Back to Notes

Attention Is All You Need - Transformer Architecture

Research Papers
2025-11-12

Paper Summary

Attention Is All You Need

Vaswani et al.2017

Read Paper

Attention Is All You Need

Authors: Vaswani et al. (2017)

Core Idea

The Transformer architecture revolutionized sequence-to-sequence modeling by relying entirely on attention mechanisms, dispensing with recurrence and convolutions.

Key Components

Self-Attention Mechanism

The scaled dot-product attention is computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

  • QQ (queries), KK (keys), and VV (values) are linear projections of the input
  • dkd_k is the dimension of the keys
  • The scaling factor 1dk\frac{1}{\sqrt{d_k}} prevents the dot products from growing too large

Multi-Head Attention

Instead of performing a single attention function, multi-head attention projects the queries, keys, and values hh times with different learned projections:

MultiHead(Q,K,V)=Concat(head1,...,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O

Where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Position Embeddings

Since the model has no recurrence, positional information is injected using:

PE(pos,2i)=sin(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)

PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Architecture Benefits

  1. Parallelization: Unlike RNNs, all positions can be processed simultaneously
  2. Long-range dependencies: Direct connections between all positions
  3. Interpretability: Attention weights show what the model focuses on

Impact

This architecture became the foundation for:

  • BERT (2018)
  • GPT series (2018-present)
  • T5 (2019)
  • And virtually all modern LLMs

Key Takeaways

  • Attention mechanisms can replace recurrence entirely
  • Multi-head attention provides different representation subspaces
  • Position encodings are crucial for maintaining sequence order
  • The architecture scales exceptionally well