Lesson 2.3: Attention: the one idea that changed everything | GeekHub Learn

Attention is the engine. Without it, no ChatGPT. With it, an English sentence "knows" which earlier words modify which later ones. Understanding this one mechanism unlocks the deepest mental model in this course.

You walk into a noisy cafe and someone says your name. Your brain instantly tunes out everything else and focuses on the voice that said it. Attention in a Transformer does the same: for every token, it figures out which other tokens "deserve focus" right now.

For each token, the model computes three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (here is my payload). Every token's query is compared to every other token's key. The matches with the strongest scores get the most "attention" and their values get blended into the current token's new representation.

This is self-attention. The "self" means the comparisons are within a single sequence.

For a sequence of n tokens, attention is roughly an n x n score matrix. This is why long context windows are expensive: the matrix grows with the square of context length. Modern tricks (sparse attention, sliding windows, FlashAttention) reduce this but the cost is still real.

"Multi-head" attention means doing this several times in parallel with different learned projections, then combining. Each head learns to focus on a different relationship (syntax, coreference, topic).

Visualize it

A heatmap diagram showing a sentence on both axes ("The cat sat on the mat because it was tired"). The cell where "it" meets "cat" lights up. This is the famous "attention head learns coreference" visual.

Try it now

In ChatGPT, write: "The trophy did not fit in the suitcase because it was too big. What does 'it' refer to?" Then: "The trophy did not fit in the suitcase because it was too small. What does 'it' refer to?" Different answers. Same words. Attention is what flips them.

Hands-on lab

No code yet. Write down 3 ambiguous-pronoun sentences. Predict the model's interpretation. Test in ChatGPT. Note where attention nailed it and where it slipped.

Try it now

Why is attention n^2 in the sequence length? Explain in one sentence.

Common mistakes

Thinking attention is "the model paying attention to the user". It is internal, token-to-token.
Believing attention solves long-context perfectly. It does not. Models still get lost in 200-page documents.

Debugging tip

If a long-input prompt produces vague answers, the model may be losing attention on the middle of the input (the "lost in the middle" effect). Restructure: put the most important content at the start or end.

Challenge

Read the abstract of "Attention Is All You Need" (Vaswani et al., 2017) and write a 200-word explainer for a non-engineer.

Where this shows up

Long-document understanding (with the lost-in-the-middle caveat)
Coreference resolution in chat
Code understanding: "this variable refers to..."

From the field

Knowing the n^2 cost shape is why long-context models cost so much more. It is also why prompt compression is a real career: shrinking inputs intelligently saves orders of magnitude of compute.

Recap

Attention lets each token look at every other token and decide what matters. It is the single most important architectural innovation of the modern AI era.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

What are queries, keys, and values? What does each do?

Quiz time

1 question · tap an answer to check it

1. Attention complexity in sequence length is