How Transformers Work in Detail

Introduction

Transformers are deep learning models that changed the way machines understand text, images, and even sound. They power systems like ChatGPT, BERT, and DALL·E. In this tutorial, we explain how transformers work step by step using clear logic and simple math.

Moroccan Darija: الموديلات ديال Transformers بدلو الطريقة اللي الماكينات كاتفهم بيها النصوص، الصور وحتى الصوت. فهاد الدرس غادي نشرحو كيفاش كيخدمو بالتفصيل وبطريقة بسيطة.

Core Concepts Explained

Transformers are based on a mechanism called Self-Attention. This allows the model to understand the relationship between all words in a sentence at once.

Example sentence: “The cat sat on the mat.” The model looks at how each word relates to the others. For example, “cat” relates more to “sat” and “mat” than to “the”.

Main Components:

Input Embeddings: Convert words into numerical vectors.
Positional Encoding: Add order information to embeddings.
Self-Attention Layers: Compute relationships between words.
Feed Forward Networks: Process the attention outputs.
Layer Normalization: Stabilize training.
Residual Connections: Help keep information flowing.

Moroccan Darija: المكونات الرئيسية هما:

Embeddings كتحول الكلمات لأرقام.
Positional Encoding كيعطي ترتيب للكلمات.
Self-Attention كيشوف العلاقة بين كل كلمة والتانية.
Feed Forward كيعالج النتائج.
Normalization كيساعد فالتدريب.
Residual Connections كتحافظ على المعلومة.

Syntax and Model Structure

A Transformer model is made of two parts: an Encoder and a Decoder.

Encoder: Reads the input sentence and creates hidden representations.
Decoder: Generates output based on encoder outputs and previous words.

# Simple transformer-like structure in PyTorch
import torch
from torch import nn

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.attention = nn.MultiheadAttention(embed_dim, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.norm = nn.LayerNorm(embed_dim)
    
    def forward(self, x):
        x = self.embedding(x)
        attn_output, _ = self.attention(x, x, x)
        x = self.norm(x + attn_output)
        x = self.feed_forward(x)
        return x

Practical Examples

Example 1: Understanding Word Relationships

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("The cat sat on the mat", return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

This shows how BERT encodes each token into a vector that contains contextual information.

Example 2: Machine Translation

from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = "I love learning AI"
inputs = tokenizer(text, return_tensors="pt")
translated = model.generate(**inputs)
print(tokenizer.decode(translated[0], skip_special_tokens=True))

Darija Explanation: هاد المثال كيبين كيفاش Transformer كيدير الترجمة الآلية، بحال من الإنجليزية للفرنسية.

Example 3: Text Generation

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_ids = tokenizer.encode("Artificial Intelligence is", return_tensors="pt")
output = model.generate(input_ids, max_length=20)
print(tokenizer.decode(output[0], skip_special_tokens=True))

This example shows how GPT models use the decoder part to generate coherent sentences.

Explanation of Each Example

Example 1: Shows attention in action — every word depends on context.
Example 2: Uses encoder-decoder structure for translation.
Example 3: Uses decoder-only architecture to predict next words.

10 Exercises for Practice

Explain what self-attention means in your own words.
Write a Python function to normalize attention scores using softmax.
Modify the SimpleTransformer class to add dropout.
Compare encoder-only (BERT) and decoder-only (GPT) architectures.
Try using a different tokenizer and see how tokenization changes.
Visualize attention scores for a simple sentence using any library.
Train a small transformer on a custom text dataset.
Explain why positional encoding is needed.
Implement a small feed-forward block in PyTorch.
Experiment with different numbers of heads in MultiheadAttention.

Internal Linking Suggestions

[internal link: Machine Learning Basics]

[internal link: Neural Networks Guide]

[internal link: Attention Mechanism Explained]

Conclusion

Transformers changed AI by allowing parallel training and better context understanding. They are now the foundation for models in NLP, vision, and multimodal AI.

Darija Summary: Transformers دارو ثورة فالعالم ديال الذكاء الاصطناعي، وخلاو الموديلات تفهم النصوص والصور بطريقة قوية وسريعة.