What Are Transformers?
In the landscape of deep learning, the Transformer represents a fundamental architectural paradigm shift. Introduced in 2017 by Vaswani et al. in "Attention Is All You Need," Transformers replaced sequential processing with a purely attention-based mechanism that processes entire sequences in parallel.
Definition: Transformer
A Transformer is a neural network architecture that maps sequences to sequences using self-attention as its core computational primitive. Unlike RNNs, it has no inherent notion of sequential order—position information must be explicitly encoded.
For mathematicians, the key insight is this: whereas an RNN processes a sequence \((x_1, x_2, \ldots, x_n)\) by maintaining a hidden state \(h_t = f(h_{t-1}, x_t)\) that evolves through time, a Transformer computes representations for all positions simultaneously via learned weighted combinations.
The Core Innovation
The mathematical elegance of Transformers lies in replacing recurrence with attention—a mechanism that computes, for each element in a sequence, a weighted average of all other elements where the weights themselves are computed from the data.
Key Mathematical Insight
Self-attention can be viewed as a soft, differentiable version of database lookup. Given a query, it retrieves values by computing similarity to keys. This is fundamentally a weighted sum where weights come from a softmax over dot products—entirely composed of differentiable operations.
Prerequisites
- Feedforward networks: compositions of affine transformations with nonlinearities
- Gradient descent: optimizing parameters via backpropagation
- Embeddings: mapping discrete tokens to continuous vector spaces
- The softmax function: \(\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\)
How Transformers Changed AI
To appreciate the significance of Transformers, we must understand what came before.
The Sequential Bottleneck
Before Transformers, sequence modeling was dominated by Recurrent Neural Networks (RNNs). These process sequences step-by-step:
This sequential dependency creates two fundamental problems:
- Parallelization impossible: Computing \(h_t\) requires \(h_{t-1}\)
- Long-range dependencies: Information must flow through many steps, leading to vanishing gradients
2014
Sequence-to-Sequence with Attention
Bahdanau et al. introduce attention for machine translation, allowing decoders to "look back" at encoder states.
2017
Attention Is All You Need
Vaswani et al. eliminate recurrence entirely. State-of-the-art translation with 100× less training time.
2018
BERT & GPT
Transformers pretrained on massive text corpora revolutionize NLP.
2020
Vision Transformers
Dosovitskiy et al. apply Transformers to images, challenging CNN dominance.
2022+
Large Language Models
GPT-4, Claude, Gemini—hundreds of billions of parameters, emergent capabilities.
The Parallelization Advantage
The core attention operation can be expressed as matrix multiplications:
This single expression computes attention for all positions at once—fully parallelizable on GPU hardware.
Computational Complexity
- RNN: \(O(n)\) sequential operations, each \(O(d^2)\) → non-parallelizable
- Transformer: \(O(n^2d)\) total operations, fully parallelizable
The Transformer Architecture
The original Transformer follows an encoder-decoder structure. Modern variants often use only encoder (BERT) or decoder (GPT).
Component 1: Token Embeddings
Each token is mapped to a dense vector via learned embedding matrix \(E \in \mathbb{R}^{|V| \times d}\):
Component 2: Positional Encoding
Attention has no inherent notion of position, so we inject it with sinusoidal encodings:
Component 3: Layer Normalization
Component 4: Residual Connections
Component 5: Feed-Forward Network
The Self-Attention Mechanism
Self-attention is the mathematical heart of the Transformer.
Definition: Scaled Dot-Product Attention
Given queries \(Q \in \mathbb{R}^{n \times d_k}\), keys \(K \in \mathbb{R}^{n \times d_k}\), and values \(V \in \mathbb{R}^{n \times d_v}\):
Step 1: Q, K, V Projections
For self-attention, all three derive from the same input \(X\):
Step 2: Computing Attention Scores
Proposition: Scaling Factor
If \(q, k\) have i.i.d. components with mean 0 and variance 1, then \(\text{Var}(q \cdot k) = d_k\). Dividing by \(\sqrt{d_k}\) normalizes variance to 1.
Step 3: Softmax Normalization
Step 4: Weighted Aggregation
Multi-Head Attention
Interactive Visualization
Visualizing Attention Patterns
Click a word to see its attention distribution
Causal (Masked) Attention
where \(M_{ij} = 0\) if \(j \leq i\) and \(M_{ij} = -\infty\) otherwise.
Worked Example
Let's trace through a complete self-attention computation.
Setup: \(n = 3\) tokens, \(d = 4\), \(d_k = d_v = 2\)
STEP 1
Define Projection Matrices
STEP 2
Compute Q, K, V
STEP 3
Compute Attention Scores
STEP 4
Apply Softmax
Interpretation: Token A attends mostly to B (0.78), B to A (0.78), C equally to all.
STEP 5
Compute Output
What Did We Compute?
Each output row is a weighted combination of value vectors, where weights depend on query-key similarity. Token A's output is dominated by B's value because A's query aligned best with B's key.
Modern Applications
Transformers have become dominant across virtually all domains of ML.
Large Language Models
GPT-4, Claude, Gemini—decoder-only Transformers trained on trillions of tokens with emergent reasoning.
Computer Vision
Vision Transformers treat images as sequences of patches, matching or exceeding CNNs.
Image Generation
DALL-E, Stable Diffusion use Transformer-based architectures for text-to-image.
Protein Structure
AlphaFold 2 uses attention to predict 3D structures from sequences.
Audio & Speech
Whisper, AudioLM, and music models process sequential audio data.
Code Generation
Codex, Copilot generate, explain, and debug code from natural language.
Scaling Laws
Transformer performance follows predictable power laws:
Open Questions
- Expressivity: What function classes can Transformers represent efficiently?
- Generalization: Why do overparameterized models generalize well?
- In-context learning: How do models learn from prompt examples without weight updates?
- Emergence: Why do capabilities appear suddenly at certain scales?
Summary
Key Takeaways
- Transformers replace recurrence with attention—enabling parallel processing
- Self-attention is a learnable weighted average—softmax-normalized dot products
- The architecture is modular—attention, FFN, residual, normalization
- Scaling is predictable—power laws in parameters, data, compute
- Applications are universal—language, vision, biology, and beyond
Further Reading
- Vaswani et al., "Attention Is All You Need" (2017)
- Yun et al., "Are Transformers universal approximators?" (2019)
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
- Elhage et al., "A Mathematical Framework for Transformer Circuits" (2021)
Discussion Questions
- What structures might allow sub-quadratic attention while preserving expressivity?
- What are the implications of positional encodings for position-independent algorithms?
- Is there a connection between multi-head attention and tensor decomposition?
- How does soft attention compare to hard attention in optimization?