Policy Gradient Methods

Policy Gradient Tutorial

Introduction

Policy Gradient methods are a family of Reinforcement Learning algorithms that directly optimize the policy instead of estimating value functions like Q-Learning or DQN. These methods use gradient ascent to adjust the parameters of a policy network to maximize expected rewards.

بالعربية المغربية (الدارجة): طرق Policy Gradient كيتعلمو السياسة (policy) مباشرة بلا ما يحسبو القيم Q. كيتستعملو الشبكات العصبية باش يتبدلو المعاملات ديال السياسة تدريجياً باش تزيد الأرباح المتوقعة.

Core Concepts Explained

Policy (π): A function that maps states to probabilities of actions.
Objective Function (J(θ)): The goal is to maximize the expected cumulative reward under the policy parameters θ.
Gradient Ascent: Instead of minimizing loss, we maximize performance by moving parameters in the direction of the gradient.
Monte Carlo Estimation: Used to estimate returns from sampled episodes.

بالعربية المغربية: السياسة (policy) كتحدد شنو الفعل اللي يدير الوكيل فكل حالة. الهدف هو نحسنو السياسة باش نربحو بزاف فالمتوسط. كنستعملو الانحدار Gradient باش نزيدو فالقيمة ديال الأرباح.

Mathematical Formula

The main idea of Policy Gradient is to adjust the parameters θ in the direction that increases performance:


∇θ J(θ) = E[ ∇θ log πθ(a | s) * Gt ]

Here, Gt is the total discounted reward after time step t, and log πθ(a | s) is the logarithm of the policy probability for the chosen action.

Python Example: Simple REINFORCE Algorithm (Policy Gradient)


import gym
import torch
import torch.nn as nn
import torch.optim as optim

# Define Policy Network
class PolicyNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return torch.softmax(self.fc2(x), dim=-1)

# Initialize environment and policy
env = gym.make("CartPole-v1")
policy = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
optimizer = optim.Adam(policy.parameters(), lr=0.01)
gamma = 0.99

def select_action(state):
    state = torch.from_numpy(state).float()
    probs = policy(state)
    action = torch.multinomial(probs, 1).item()
    return action, torch.log(probs[action])

# Training loop
for episode in range(500):
    log_probs = []
    rewards = []
    state = env.reset()[0]
    done = False
    while not done:
        action, log_prob = select_action(state)
        next_state, reward, done, _, _ = env.step(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        state = next_state
    # Compute returns
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-9)
    # Policy update
    loss = []
    for log_prob, Gt in zip(log_probs, returns):
        loss.append(-log_prob * Gt)
    loss = torch.stack(loss).sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if episode % 50 == 0:
        print(f"Episode {episode}, Total Reward: {sum(rewards)}")

Explanation of the Example

The policy network outputs probabilities for each action.
The agent samples actions according to the policy.
Rewards are collected during the episode.
The total discounted return is computed for each step.
The network parameters are updated using gradient ascent on the log probabilities weighted by returns.

بالعربية المغربية: الشبكة العصبية كتنتج احتمال كل فعل. الوكيل كيجرب الأفعال وكيجمع المكافآت. من بعد كيتحسب الربح الكلي (discounted return) وكتتصلح الشبكة باش الأفعال المزيانة تولي أكثر احتمال.

Advantages of Policy Gradient

Works well in continuous action spaces.
Can learn stochastic policies.
More stable for high-dimensional problems.

Limitations

High variance in gradient estimates.
Requires many episodes to converge.
Performance depends on good hyperparameter tuning.

بالعربية المغربية: من ميزات Policy Gradient أنه خدام مزيان فالأفعال المتواصلة وكيقدر يتعلم سياسات عشوائية. لكن كيعاني من التباين الكبير وكيحتاج تجارب بزاف باش يوصل لحل جيد.

Advanced Extensions

Actor-Critic: Combines Policy Gradient with a value estimator for lower variance.
Proximal Policy Optimization (PPO): Improves training stability using clipping.
Trust Region Policy Optimization (TRPO): Restricts large policy updates for safety.

بالعربية المغربية: كاينين نسخ مطوّرة بحال Actor-Critic اللي كتدمج بين السياسة والقيمة، وPPO وTRPO اللي كيعطيو استقرار أكبر فالتعلم.

10 Exercises for Practice

Define the goal of Policy Gradient methods.
Write down the formula for the policy gradient update.
Explain the difference between Q-Learning and Policy Gradient.
Implement REINFORCE in any simple Gym environment.
Experiment with different learning rates and observe performance.
Normalize returns and compare results with unnormalized ones.
Add a baseline to reduce variance in gradient estimates.
Implement Actor-Critic and compare with pure Policy Gradient.
Plot total rewards across episodes to visualize learning.
Explain why Policy Gradient is effective in continuous action spaces.