Attention: the one idea that changed everything
Attention is the engine. Without it, no ChatGPT. With it, an English sentence "knows" which earlier words modify which later ones. Understanding this one mechanism unlocks the deepest mental model in this course.
You walk into a noisy cafe and someone says your name. Your brain instantly tunes out everything else and focuses on the voice that said it. Attention in a Transformer does the same: for every token, it figures out which other tokens "deserve focus" right now.
For each token, the model computes three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (here is my payload). Every token's query is compared to every other token's key. The matches with the strongest scores get the most "attention" and their values get blended into the current token's new representation.
This is self-attention. The "self" means the comparisons are within a single sequence.
For a sequence of n tokens, attention is roughly an n x n score matrix. This is why long context windows are expensive: the matrix grows with the square of context length. Modern tricks (sparse attention, sliding windows, FlashAttention) reduce this but the cost is still real.
"Multi-head" attention means doing this several times in parallel with different learned projections, then combining. Each head learns to focus on a different relationship (syntax, coreference, topic).
Quick recall
3 prompts · think before you flip
Prompt 1 of 3
What are queries, keys, and values? What does each do?
Quiz time
1 question · tap an answer to check it
1. Attention complexity in sequence length is
Finished lesson 2.3?
Mark complete to update your module progress and unlock the streak.