Q-Learning Tutorial

Introduction

Q-Learning is a model-free Reinforcement Learning algorithm. It allows an agent to learn the best actions to take in an environment without knowing the transition probabilities. The goal is to learn the optimal action-value function, called the Q-function.

بالعربية المغربية (الدارجة): خوارزمية Q-Learning هي وحدة من أشهر الطرق فالتعلم بالتعزيز. كتعلم الوكيل ياخد قرارات صحيحة بلا ما يعرف شنو القوانين ديال البيئة. الهدف ديالها هو يتعلم قيمة كل فعل فكل حالة باش يختار الأفضل.

Core Concepts Explained

Q-Value (Q(s, a)): The expected total reward when taking action a in state s and following the optimal policy after that.
Learning Rate (α): Controls how fast the algorithm updates knowledge.
Discount Factor (γ): Measures how much the agent values future rewards.
Exploration vs Exploitation: The agent must balance between exploring new actions and exploiting the best-known actions.

بالعربية المغربية: القيمة ديال Q (Q(s,a)) كتورينا شحال كيتوقع الوكيل يربح إلا دار الفعل a فالحالة s. وعندنا معاملات بحال α (سرعة التعلم) و γ (أهمية المستقبل). الوكيل خاصو يوازن بين يجرب الجديد ولا يستعمل شنو تعلم.

Q-Learning Equation

The core update rule of Q-Learning is:


Q(s, a) = Q(s, a) + α * [R + γ * max(Q(s', a')) - Q(s, a)]

This formula updates the Q-value based on the immediate reward and the estimated future reward.

Python Example: Simple Q-Learning


import numpy as np
import random

# Define environment
states = [0, 1, 2, 3, 4, 5]
actions = [0, 1]  # 0 = left, 1 = right
rewards = [0, 0, 0, 0, 1, 10]  # goal at state 5

# Initialize Q-table
Q = np.zeros((len(states), len(actions)))

# Parameters
alpha = 0.1
gamma = 0.9
epsilon = 0.1  # exploration rate
episodes = 200

for episode in range(episodes):
    state = 0
    while state != 5:
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = np.argmax(Q[state, :])
        next_state = state + 1 if action == 1 else max(state - 1, 0)
        reward = rewards[next_state]
        Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state, :]) - Q[state, action])
        state = next_state

print("Learned Q-table:")
print(Q)

Explanation of the Example

The environment is a 1D path of 6 states.
The agent can move left or right.
Rewards are given only at the last state.
Q-table is updated with the Q-Learning equation.
Over episodes, the agent learns which actions lead to the goal faster.

بالعربية المغربية: الوكيل كيتعلم يمشي من الحالة 0 حتى الحالة 5. فالأول كيجرب بزاف، ومن بعد كيبدا يعرف شنو الأفعال اللي كتوصلو بسرعة للجائزة. الماتريصة Q كتتعمر بالقيم مع الوقت.

Visualization of Learning Process (Optional)

After enough episodes, the Q-table should show higher values for actions that move the agent toward the goal.

10 Exercises for Practice

Define what Q-Learning is and its main goal.
Explain the difference between Q-Learning and Value Iteration.
Write the Q-Learning update rule in your own words.
Implement Q-Learning for a grid environment (3x3 states).
Change the learning rate α and observe its effect.
Modify the exploration rate ε and describe what happens.
Plot the total reward per episode using matplotlib.
Implement a greedy policy from a learned Q-table.
Explain how Q-Learning can work in stochastic environments.
Extend Q-Learning to continuous states using function approximation.

Advanced Topic: ε-Greedy Policy

In Q-Learning, the ε-greedy policy is used to balance exploration and exploitation. With probability ε, the agent explores (chooses random actions). With probability (1 - ε), it exploits the best known action.

بالعربية المغربية: سياسة ε-greedy كتخلي الوكيل يجرب أفعال جديدة بنسبة صغيرة (ε) ويدير الأفعال المضمونة بنسبة كبيرة (1-ε). بهاد الطريقة كيقدر يتعلم مزيان بلا ما يبقى حابس فحلول ناقصة.