Transformers: A Mathematical Perspective

Section 01

What Are Transformers?

In the landscape of deep learning, the Transformer represents a fundamental architectural paradigm shift. Introduced in 2017 by Vaswani et al. in "Attention Is All You Need," Transformers replaced sequential processing with a purely attention-based mechanism that processes entire sequences in parallel.

Definition: Transformer

A Transformer is a neural network architecture that maps sequences to sequences using self-attention as its core computational primitive. Unlike RNNs, it has no inherent notion of sequential order—position information must be explicitly encoded.

For mathematicians, the key insight is this: whereas an RNN processes a sequence \((x_1, x_2, \ldots, x_n)\) by maintaining a hidden state \(h_t = f(h_{t-1}, x_t)\) that evolves through time, a Transformer computes representations for all positions simultaneously via learned weighted combinations.

The Core Innovation

The mathematical elegance of Transformers lies in replacing recurrence with attention—a mechanism that computes, for each element in a sequence, a weighted average of all other elements where the weights themselves are computed from the data.

Key Mathematical Insight

Self-attention can be viewed as a soft, differentiable version of database lookup. Given a query, it retrieves values by computing similarity to keys. This is fundamentally a weighted sum where weights come from a softmax over dot products—entirely composed of differentiable operations.

Prerequisites

Feedforward networks: compositions of affine transformations with nonlinearities
Gradient descent: optimizing parameters via backpropagation
Embeddings: mapping discrete tokens to continuous vector spaces
The softmax function: \(\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\)

Section 02

How Transformers Changed AI

To appreciate the significance of Transformers, we must understand what came before.

The Sequential Bottleneck

Before Transformers, sequence modeling was dominated by Recurrent Neural Networks (RNNs). These process sequences step-by-step:

h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)

This sequential dependency creates two fundamental problems:

Parallelization impossible: Computing \(h_t\) requires \(h_{t-1}\)
Long-range dependencies: Information must flow through many steps, leading to vanishing gradients

2014

Sequence-to-Sequence with Attention

Bahdanau et al. introduce attention for machine translation, allowing decoders to "look back" at encoder states.

2017

Attention Is All You Need

Vaswani et al. eliminate recurrence entirely. State-of-the-art translation with 100× less training time.

2018

BERT & GPT

Transformers pretrained on massive text corpora revolutionize NLP.

2020

Vision Transformers

Dosovitskiy et al. apply Transformers to images, challenging CNN dominance.

2022+

Large Language Models

GPT-4, Claude, Gemini—hundreds of billions of parameters, emergent capabilities.

The Parallelization Advantage

The core attention operation can be expressed as matrix multiplications:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

This single expression computes attention for all positions at once—fully parallelizable on GPU hardware.

Computational Complexity

RNN: \(O(n)\) sequential operations, each \(O(d^2)\) → non-parallelizable
Transformer: \(O(n^2d)\) total operations, fully parallelizable

Section 03

The Transformer Architecture

The original Transformer follows an encoder-decoder structure. Modern variants often use only encoder (BERT) or decoder (GPT).

Component 1: Token Embeddings

Each token is mapped to a dense vector via learned embedding matrix \(E \in \mathbb{R}^{|V| \times d}\):

x_i = E[\text{token}_i] \in \mathbb{R}^d

Component 2: Positional Encoding

Attention has no inherent notion of position, so we inject it with sinusoidal encodings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Component 3: Layer Normalization

\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta

Component 4: Residual Connections

\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

Component 5: Feed-Forward Network

\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2

Section 04

The Self-Attention Mechanism

Self-attention is the mathematical heart of the Transformer.

Definition: Scaled Dot-Product Attention

Given queries \(Q \in \mathbb{R}^{n \times d_k}\), keys \(K \in \mathbb{R}^{n \times d_k}\), and values \(V \in \mathbb{R}^{n \times d_v}\):

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Step 1: Q, K, V Projections

For self-attention, all three derive from the same input \(X\):

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

Step 2: Computing Attention Scores

e_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}}

Proposition: Scaling Factor

If \(q, k\) have i.i.d. components with mean 0 and variance 1, then \(\text{Var}(q \cdot k) = d_k\). Dividing by \(\sqrt{d_k}\) normalizes variance to 1.

Step 3: Softmax Normalization

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{n} \exp(e_{ik})}

Step 4: Weighted Aggregation

\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} v_j

Multi-Head Attention

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

Interactive Visualization

Visualizing Attention Patterns

The cat sat on the mat because it was soft

Click a word to see its attention distribution

Causal (Masked) Attention

\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V

where \(M_{ij} = 0\) if \(j \leq i\) and \(M_{ij} = -\infty\) otherwise.

Section 05

Worked Example

Let's trace through a complete self-attention computation.

Setup: \(n = 3\) tokens, \(d = 4\), \(d_k = d_v = 2\)

X = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \end{bmatrix}

STEP 1

Define Projection Matrices

W^Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad W^K = \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad W^V = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}

STEP 2

Compute Q, K, V

Q = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 0 & 2 \\ 2 & 0 \\ 1 & 1 \end{bmatrix}, \quad V = \begin{bmatrix} 2 & 1 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}

STEP 3

Compute Attention Scores

QK^\top = \begin{bmatrix} 0 & 4 & 2 \\ 4 & 0 & 2 \\ 2 & 2 & 2 \end{bmatrix}, \quad \frac{QK^\top}{\sqrt{2}} \approx \begin{bmatrix} 0 & 2.83 & 1.41 \\ 2.83 & 0 & 1.41 \\ 1.41 & 1.41 & 1.41 \end{bmatrix}

STEP 4

Apply Softmax

A \approx \begin{bmatrix} 0.05 & 0.78 & 0.17 \\ 0.78 & 0.05 & 0.17 \\ 0.33 & 0.33 & 0.33 \end{bmatrix}

Interpretation: Token A attends mostly to B (0.78), B to A (0.78), C equally to all.

STEP 5

Compute Output

\text{Output} = AV \approx \begin{bmatrix} 0.27 & 1.0 \\ 1.73 & 1.0 \\ 1.0 & 1.0 \end{bmatrix}

What Did We Compute?

Each output row is a weighted combination of value vectors, where weights depend on query-key similarity. Token A's output is dominated by B's value because A's query aligned best with B's key.

Section 06

Modern Applications

Transformers have become dominant across virtually all domains of ML.

💬

Large Language Models

GPT-4, Claude, Gemini—decoder-only Transformers trained on trillions of tokens with emergent reasoning.

🖼️

Computer Vision

Vision Transformers treat images as sequences of patches, matching or exceeding CNNs.

🎨

Image Generation

DALL-E, Stable Diffusion use Transformer-based architectures for text-to-image.

🧬

Protein Structure

AlphaFold 2 uses attention to predict 3D structures from sequences.

🎵

Audio & Speech

Whisper, AudioLM, and music models process sequential audio data.

💻

Code Generation

Codex, Copilot generate, explain, and debug code from natural language.

Scaling Laws

Transformer performance follows predictable power laws:

L(N) \propto N^{-0.076}, \quad L(D) \propto D^{-0.095}, \quad L(C) \propto C^{-0.050}

Open Questions

Expressivity: What function classes can Transformers represent efficiently?
Generalization: Why do overparameterized models generalize well?
In-context learning: How do models learn from prompt examples without weight updates?
Emergence: Why do capabilities appear suddenly at certain scales?

Section 07

Summary

Key Takeaways

Transformers replace recurrence with attention—enabling parallel processing
Self-attention is a learnable weighted average—softmax-normalized dot products
The architecture is modular—attention, FFN, residual, normalization
Scaling is predictable—power laws in parameters, data, compute
Applications are universal—language, vision, biology, and beyond

Discussion Questions

What structures might allow sub-quadratic attention while preserving expressivity?
What are the implications of positional encodings for position-independent algorithms?
Is there a connection between multi-head attention and tensor decomposition?
How does soft attention compare to hard attention in optimization?

What Are Transformers?

The Core Innovation

Prerequisites

How Transformers Changed AI

The Sequential Bottleneck

The Parallelization Advantage

The Transformer Architecture

Component 1: Token Embeddings

Component 2: Positional Encoding

Component 3: Layer Normalization

Component 4: Residual Connections

Component 5: Feed-Forward Network

The Self-Attention Mechanism

Step 1: Q, K, V Projections

Step 2: Computing Attention Scores

Step 3: Softmax Normalization

Step 4: Weighted Aggregation

Multi-Head Attention

Interactive Visualization

Visualizing Attention Patterns

Causal (Masked) Attention

Worked Example

Modern Applications

Large Language Models

Computer Vision

Image Generation

Protein Structure

Audio & Speech

Code Generation

Scaling Laws

Open Questions

Summary

Further Reading

Discussion Questions