Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs) Tutorial

Introduction

Markov Decision Processes, or MDPs, are the foundation of Reinforcement Learning. They describe environments where an agent makes decisions, receives rewards, and transitions between states. MDPs help machines learn how to act optimally in uncertain situations.

بالعربية المغربية (الدارجة): عمليات ماركوف ديال اتخاذ القرار (Markov Decision Processes) هي الأساس ديال التعلم بالتعزيز. كتوصف البيئة اللي فيها الوكيل (agent) كيدير قرارات، كيتلقى مكافآت، وكيبدل الحالات. الهدف هو يتعلم كيفاش يتصرف بطريقة مثالية فظروف فيها عدم اليقين.

Core Concepts Explained

An MDP is defined by four main components:

States (S): All possible situations where the agent can be.
Actions (A): All possible moves the agent can take.
Transition Function (P): Probability of moving from one state to another after taking an action.
Reward Function (R): The immediate reward received after an action.

Formally, an MDP is a tuple (S, A, P, R, γ) where γ (gamma) is the discount factor, controlling how much the agent values future rewards.

بالعربية المغربية: الـ MDP فيها أربع عناصر رئيسيين: الحالات، الأفعال، احتمال الانتقال، والمكافآت. كل عنصر عندو دور فتعليم الوكيل شنو يدير باش يوصل لأفضل نتيجة.

Mathematical Representation

The goal in an MDP is to find a policy π(s) that maximizes the expected return:

V*(s) = max_a [ R(s, a) + γ * Σ P(s' | s, a) * V*(s') ]

This is called the Bellman Optimality Equation.

Python Example: Value Iteration


import numpy as np

# States: 0, 1, 2
# Actions: 0=Left, 1=Right
P = {
    0: {0: [(1.0, 0, 0, False)], 1: [(1.0, 1, 0, False)]},
    1: {0: [(1.0, 0, 0, False)], 1: [(1.0, 2, 1, True)]},
    2: {0: [(1.0, 2, 0, True)], 1: [(1.0, 2, 0, True)]}
}

def value_iteration(P, gamma=0.9, theta=0.001):
    V = np.zeros(3)
    while True:
        delta = 0
        for s in range(3):
            v = V[s]
            Q = []
            for a in P[s]:
                Q.append(sum([p * (r + gamma * V[s_]) for p, s_, r, done in P[s][a]]))
            V[s] = max(Q)
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    return V

V = value_iteration(P)
print("Optimal Values:", V)

Explanation of the Example

The states represent positions in a simple environment.
The actions are Left and Right.
The transition dictionary P defines how actions change the state.
The algorithm updates the value of each state until convergence using the Bellman equation.

بالعربية المغربية: فالكود، عندنا 3 حالات و جوج ديال الأفعال. كل مرة الوكيل كيجرب يشوف شنو القيمة الأفضل لكل حالة حتى كيتوقف التغيير (يعني وصل للحل الأمثل).

Policy Iteration (Alternative Method)

Another way to solve MDPs is Policy Iteration, which alternates between evaluating a policy and improving it.


def policy_iteration(P, gamma=0.9, theta=0.001):
    policy = np.zeros(3, dtype=int)
    V = np.zeros(3)
    while True:
        # Policy Evaluation
        while True:
            delta = 0
            for s in range(3):
                v = V[s]
                a = policy[s]
                V[s] = sum([p * (r + gamma * V[s_]) for p, s_, r, done in P[s][a]])
                delta = max(delta, abs(v - V[s]))
            if delta < theta:
                break
        # Policy Improvement
        stable = True
        for s in range(3):
            old_action = policy[s]
            Q = [sum([p * (r + gamma * V[s_]) for p, s_, r, done in P[s][a]]) for a in P[s]]
            policy[s] = np.argmax(Q)
            if old_action != policy[s]:
                stable = False
        if stable:
            break
    return policy, V

policy, V = policy_iteration(P)
print("Optimal Policy:", policy)

10 Exercises for Practice

Define what an MDP is in your own words.
List the main components of an MDP.
Write the Bellman Equation and explain each term.
Implement Value Iteration for a 4-state environment.
Change the discount factor γ and see how it affects the results.
Implement Policy Iteration in Python.
Compare the outputs of Value Iteration and Policy Iteration.
Create an MDP where rewards are negative and observe the effect.
Write a function that extracts the optimal policy from a value function.
Explain the difference between deterministic and stochastic policies.