Attention Is All You Need - Transformer Architecture
Research Papers
2025-11-12
Attention Is All You Need
Authors: Vaswani et al. (2017)
Core Idea
The Transformer architecture revolutionized sequence-to-sequence modeling by relying entirely on attention mechanisms, dispensing with recurrence and convolutions.
Key Components
Self-Attention Mechanism
The scaled dot-product attention is computed as:
Where:
- (queries), (keys), and (values) are linear projections of the input
- is the dimension of the keys
- The scaling factor prevents the dot products from growing too large
Multi-Head Attention
Instead of performing a single attention function, multi-head attention projects the queries, keys, and values times with different learned projections:
Where each head is:
Position Embeddings
Since the model has no recurrence, positional information is injected using:
Architecture Benefits
- Parallelization: Unlike RNNs, all positions can be processed simultaneously
- Long-range dependencies: Direct connections between all positions
- Interpretability: Attention weights show what the model focuses on
Impact
This architecture became the foundation for:
- BERT (2018)
- GPT series (2018-present)
- T5 (2019)
- And virtually all modern LLMs
Key Takeaways
- Attention mechanisms can replace recurrence entirely
- Multi-head attention provides different representation subspaces
- Position encodings are crucial for maintaining sequence order
- The architecture scales exceptionally well