1 / 8
A Mathematical Introduction

The Transformer
Architecture

Exploring the mathematical foundations behind the architecture that revolutionized artificial intelligence.

Nir Naim
Tel Aviv University
Queueing Theory Seminar

What Are Transformers?

In the landscape of deep learning, the Transformer represents a fundamental architectural paradigm shift. Introduced in 2017 by Vaswani et al. in "Attention Is All You Need," Transformers replaced sequential processing with a purely attention-based mechanism that processes entire sequences in parallel.

Definition: Transformer

A Transformer is a neural network architecture that maps sequences to sequences using self-attention as its core computational primitive. Unlike RNNs, it has no inherent notion of sequential order—position information must be explicitly encoded.

For mathematicians, the key insight is this: whereas an RNN processes a sequence \((x_1, x_2, \ldots, x_n)\) by maintaining a hidden state \(h_t = f(h_{t-1}, x_t)\) that evolves through time, a Transformer computes representations for all positions simultaneously via learned weighted combinations.

The Core Innovation

The mathematical elegance of Transformers lies in replacing recurrence with attention—a mechanism that computes, for each element in a sequence, a weighted average of all other elements where the weights themselves are computed from the data.

Key Mathematical Insight

Self-attention can be viewed as a soft, differentiable version of database lookup. Given a query, it retrieves values by computing similarity to keys. This is fundamentally a weighted sum where weights come from a softmax over dot products—entirely composed of differentiable operations.

Prerequisites

How Transformers Changed AI

To appreciate the significance of Transformers, we must understand what came before.

The Sequential Bottleneck

Before Transformers, sequence modeling was dominated by Recurrent Neural Networks (RNNs). These process sequences step-by-step:

\[h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)\]

This sequential dependency creates two fundamental problems:

  1. Parallelization impossible: Computing \(h_t\) requires \(h_{t-1}\)
  2. Long-range dependencies: Information must flow through many steps, leading to vanishing gradients

2014

Sequence-to-Sequence with Attention

Bahdanau et al. introduce attention for machine translation, allowing decoders to "look back" at encoder states.

2017

Attention Is All You Need

Vaswani et al. eliminate recurrence entirely. State-of-the-art translation with 100× less training time.

2018

BERT & GPT

Transformers pretrained on massive text corpora revolutionize NLP.

2020

Vision Transformers

Dosovitskiy et al. apply Transformers to images, challenging CNN dominance.

2022+

Large Language Models

GPT-4, Claude, Gemini—hundreds of billions of parameters, emergent capabilities.

The Parallelization Advantage

The core attention operation can be expressed as matrix multiplications:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

This single expression computes attention for all positions at once—fully parallelizable on GPU hardware.

Computational Complexity

  • RNN: \(O(n)\) sequential operations, each \(O(d^2)\) → non-parallelizable
  • Transformer: \(O(n^2d)\) total operations, fully parallelizable

The Transformer Architecture

The original Transformer follows an encoder-decoder structure. Modern variants often use only encoder (BERT) or decoder (GPT).

ENCODER DECODER Self-Attention Feed Forward × N layers + Positional Encoding Input Embedding Masked Self-Attention Cross-Attention Feed Forward × N layers + Positional Encoding Output Embedding

Component 1: Token Embeddings

Each token is mapped to a dense vector via learned embedding matrix \(E \in \mathbb{R}^{|V| \times d}\):

\[x_i = E[\text{token}_i] \in \mathbb{R}^d\]

Component 2: Positional Encoding

Attention has no inherent notion of position, so we inject it with sinusoidal encodings:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)\]

Component 3: Layer Normalization

\[\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta\]

Component 4: Residual Connections

\[\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))\]

Component 5: Feed-Forward Network

\[\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2\]

The Self-Attention Mechanism

Self-attention is the mathematical heart of the Transformer.

Definition: Scaled Dot-Product Attention

Given queries \(Q \in \mathbb{R}^{n \times d_k}\), keys \(K \in \mathbb{R}^{n \times d_k}\), and values \(V \in \mathbb{R}^{n \times d_v}\):

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

Step 1: Q, K, V Projections

For self-attention, all three derive from the same input \(X\):

\[Q = XW^Q, \quad K = XW^K, \quad V = XW^V\]

Step 2: Computing Attention Scores

\[e_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}}\]

Proposition: Scaling Factor

If \(q, k\) have i.i.d. components with mean 0 and variance 1, then \(\text{Var}(q \cdot k) = d_k\). Dividing by \(\sqrt{d_k}\) normalizes variance to 1.

Step 3: Softmax Normalization

\[\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{n} \exp(e_{ik})}\]

Step 4: Weighted Aggregation

\[\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} v_j\]

Multi-Head Attention

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]

Interactive Visualization

Visualizing Attention Patterns

The cat sat on the mat because it was soft

Click a word to see its attention distribution

Causal (Masked) Attention

\[\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)V\]

where \(M_{ij} = 0\) if \(j \leq i\) and \(M_{ij} = -\infty\) otherwise.

Worked Example

Let's trace through a complete self-attention computation.

Setup: \(n = 3\) tokens, \(d = 4\), \(d_k = d_v = 2\)

\[X = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \end{bmatrix}\]

STEP 1

Define Projection Matrices

\[W^Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad W^K = \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad W^V = \begin{bmatrix} 1 & 1 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}\]

STEP 2

Compute Q, K, V

\[Q = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 0 & 2 \\ 2 & 0 \\ 1 & 1 \end{bmatrix}, \quad V = \begin{bmatrix} 2 & 1 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}\]

STEP 3

Compute Attention Scores

\[QK^\top = \begin{bmatrix} 0 & 4 & 2 \\ 4 & 0 & 2 \\ 2 & 2 & 2 \end{bmatrix}, \quad \frac{QK^\top}{\sqrt{2}} \approx \begin{bmatrix} 0 & 2.83 & 1.41 \\ 2.83 & 0 & 1.41 \\ 1.41 & 1.41 & 1.41 \end{bmatrix}\]

STEP 4

Apply Softmax

\[A \approx \begin{bmatrix} 0.05 & 0.78 & 0.17 \\ 0.78 & 0.05 & 0.17 \\ 0.33 & 0.33 & 0.33 \end{bmatrix}\]

Interpretation: Token A attends mostly to B (0.78), B to A (0.78), C equally to all.

STEP 5

Compute Output

\[\text{Output} = AV \approx \begin{bmatrix} 0.27 & 1.0 \\ 1.73 & 1.0 \\ 1.0 & 1.0 \end{bmatrix}\]

What Did We Compute?

Each output row is a weighted combination of value vectors, where weights depend on query-key similarity. Token A's output is dominated by B's value because A's query aligned best with B's key.

Modern Applications

Transformers have become dominant across virtually all domains of ML.

💬

Large Language Models

GPT-4, Claude, Gemini—decoder-only Transformers trained on trillions of tokens with emergent reasoning.

🖼️

Computer Vision

Vision Transformers treat images as sequences of patches, matching or exceeding CNNs.

🎨

Image Generation

DALL-E, Stable Diffusion use Transformer-based architectures for text-to-image.

🧬

Protein Structure

AlphaFold 2 uses attention to predict 3D structures from sequences.

🎵

Audio & Speech

Whisper, AudioLM, and music models process sequential audio data.

💻

Code Generation

Codex, Copilot generate, explain, and debug code from natural language.

Scaling Laws

Transformer performance follows predictable power laws:

\[L(N) \propto N^{-0.076}, \quad L(D) \propto D^{-0.095}, \quad L(C) \propto C^{-0.050}\]

Open Questions

Summary

Key Takeaways

  1. Transformers replace recurrence with attention—enabling parallel processing
  2. Self-attention is a learnable weighted average—softmax-normalized dot products
  3. The architecture is modular—attention, FFN, residual, normalization
  4. Scaling is predictable—power laws in parameters, data, compute
  5. Applications are universal—language, vision, biology, and beyond

Further Reading

Discussion Questions

  1. What structures might allow sub-quadratic attention while preserving expressivity?
  2. What are the implications of positional encodings for position-independent algorithms?
  3. Is there a connection between multi-head attention and tensor decomposition?
  4. How does soft attention compare to hard attention in optimization?
Navigate Home End Jump