Transformers
Transformers are deep learning models that work with sequence data. They use attention to understand relations between tokens. They power modern models in language, vision, and multimodal AI.
Why Transformers Matter
- They process sequences without recurrence.
- They support parallel computation.
- They capture long range patterns.
Core Architecture
1. Input Embeddings
Tokens turn into vectors. These vectors store meaning for each token.
2. Positional Encoding
Transformers add position information because they do not use recurrence.
3. Encoder Blocks
Each block uses attention and a feedforward layer. Encoders build strong representations for inputs.
4. Decoder Blocks
Decoders generate outputs. They use masked attention to avoid future tokens.
Attention Mechanism
Attention shows how much each token should focus on others. It uses three parts.
- Query. Describes what the token needs.
- Key. Describes what each token offers.
- Value. Passes information to the next layer.
The model calculates attention scores and mixes values based on these scores.
Multi Head Attention
Attention splits into several heads. Each head learns a different relation. The results combine into one vector.
Feedforward Layers
Each block has a feedforward network. It adds non linear changes after attention.
Popular Transformer Models
BERT
Works for understanding text. Uses encoder blocks only.
GPT
Works for text generation. Uses decoder blocks only.
T5
Works with text to text tasks. Uses encoder and decoder.
Vision Transformers
Split images into patches and process them as sequences.
Training Transformers
- Use large datasets.
- Train with self supervised tasks.
- Use Adam or AdamW optimizers.
- Use attention masks for control.
Strengths of Transformers
- Strong with long sequences.
- Fast training with parallelism.
- Flexible for many data types.
Limitations
- High memory use
- Heavy compute needs
- Sensitive training process
Transformers in Moroccan Darija
Transformers hiyya models li katkhddm b attention. Kayfhamu relation bin tokens bla loops. Hadi khllathom ykouno qwiya f NLP w hatta f vision.
Kif Kaykhddmo
- Embeddings bach n7awlo words vectors.
- Positional encoding bach n3rfo trtib.
- Attention bach token ychouf tokens okhrin.
- Feedforward layers bach nzido transformation.
Models Mcharfin
- BERT f understanding.
- GPT f generation.
- T5 f text to text.
- Vision Transformers f images.
Conclusion
Transformers drive progress in modern AI. They use attention and parallel processing to learn strong patterns. They support language, vision, and multimodal tasks with high performance.