Module 2: How ChatGPT and Transformers Work (Beginner Friendly)

Module Goal

Lift the hood. By the end of this module you can explain on a napkin how ChatGPT goes from your prompt to its answer, what a Transformer actually does, and why "attention" matters. No heavy math.

Estimated Duration

4 to 6 hours.

Skills Learned

Explaining tokens, embeddings, attention, and next-token prediction in plain English
Tracing the path of a single prompt through an LLM
Reasoning about why LLMs hallucinate and forget
Reading a model spec without panic

Real-world Importance

Engineers who understand the machine ship better products. They write prompts that respect how the model thinks, debug failures by tracing them to the architecture, and pick models with clear-eyed reasoning instead of brand loyalty.

Lessons in this module

From prompt to answer: the 6-step journey
Tokens and embeddings: how text becomes numbers
Attention: the one idea that changed everything
Next-token prediction: ChatGPT is really autocomplete
Pretraining vs fine-tuning vs RLHF, in plain English

Lesson 2.1: From prompt to answer: the 6-step journey

Hook / Why This Matters

When you press Enter on ChatGPT, six precise things happen. Every later module hooks into one of those steps. This is your skeleton.

Beginner Analogy

Imagine a translator at the UN. They hear your words, mentally turn each into a concept, listen to the rest of the sentence to decide what each concept emphasizes, then say one word at a time in the target language, looping back to keep what they have already said in mind. Transformers do exactly that, but with numbers.

Concept Explanation

The six steps:

Tokenization: your text is chopped into small chunks called tokens.
Embedding: each token is turned into a vector of numbers.
Attention: each token "looks at" the others to decide which matter for its meaning in this context.
Stacked layers: this attention process repeats across many layers, refining understanding.
Next-token prediction: the model outputs probabilities over every possible next token.
Sampling and loop: one token is picked (by probability), appended to the input, and the whole loop runs again until a stop signal.

That is ChatGPT. Everything else (RAG, fine-tuning, agents) is wrapping or augmenting these six steps.

Technical Breakdown

The model is a stack of identical layers (often 32 to 96 of them in modern LLMs). Each layer has two sub-layers: self-attention and a feed-forward network. The output of each layer is the input to the next. The final layer projects back to vocabulary size and a softmax produces probabilities.

Visual Learning Suggestion

A horizontal pipeline diagram with 6 boxes labeled with the 6 steps. Arrows between. Loop arrow from step 6 back to step 1 showing autoregressive generation.

Interactive Element

Ask ChatGPT: "What is your next token going to be after 'The capital of France is'? Show the top 5 candidates if you can." Notice it strongly prefers "Paris" but is technically computing a distribution.

Hands-on Lab

Open https://platform.openai.com/tokenizer. Paste any 50-word paragraph. Observe how it gets split. Note how spaces, punctuation, and uncommon words get split differently. Tokens are not words.

Mini Exercise

Count: how many tokens is the sentence "Hello, world! This is GeekHub."? Use the tokenizer tool.

Common Mistakes

Thinking the model "looks up" answers. It does not. It computes them.
Thinking the model writes whole sentences at once. It writes one token at a time.
Confusing tokens with words. They are not the same.

Debugging Tips

If your app behaves oddly at high token counts, check whether you are bumping against the context window. The loop in step 6 cannot remember anything outside the current context.

Knowledge Check Questions

Recite the six steps without looking.
Why is generation called "autoregressive"?
What ends the generation loop?

Quiz Questions

Generation stops when: a) The model finishes a sentence b) A stop token is sampled or max_tokens is hit c) The user closes the tab d) The temperature reaches zero Answer: b

Challenge Task

Draw the 6-step diagram from memory on paper. Photograph it. Post it on GeekHub with #ai-beginners.

Real-world Use Cases

Streaming chat UIs: the autoregressive loop is what lets you watch the response appear word by word.
Token-level cost analysis: knowing tokens are emitted one at a time lets you predict pricing.
Function/tool calling: the model emits a structured token sequence the runtime intercepts.

Industry Insight

Knowing this loop is what lets you write production code that calls an LLM with stream=True. It is also why you can cancel a generation mid-flight, saving money on long outputs.

Interview Questions

Walk me through what happens after a user submits a prompt to ChatGPT.
Why is LLM generation called autoregressive?
How does early stopping save cost?

Summary

ChatGPT is a six-step pipeline: tokenize, embed, attend, layer, predict next token, sample and loop. Everything else builds on this.

Lesson 2.2: Tokens and embeddings: how text becomes numbers

Hook / Why This Matters

Neural networks do not understand text. They only do math on numbers. The conversion from text to numbers is where many engineers get confused and where every API price tag lives.

Beginner Analogy

Think of a multilingual phone book. Every word gets a unique phone number. Words with similar meanings get phone numbers nearby. That neighborhood structure is the "embedding".

Concept Explanation

Tokens: not words. Tokens are sub-word units chosen by the model's tokenizer (often Byte-Pair Encoding). The English word "tokenization" might be 3 tokens. The Hindi word "नमस्ते" might be 5 tokens. Code symbols often pack into 1 token.

Embeddings: each token id is mapped to a fixed-length vector of floats (often 1,024 or 4,096 numbers). This vector is the model's internal representation. Similar meanings produce nearby vectors.

Technical Breakdown

The tokenizer is a deterministic algorithm that maps strings to a vocabulary of token ids (often 50K to 200K). The model has an embedding matrix of shape [vocab_size, hidden_dim]. Looking up a token id retrieves its starting vector. Subsequent layers refine that vector contextually.

The "context window" is measured in tokens, not characters or words. A 128K context model can hold roughly 96,000 English words.

Visual Learning Suggestion

Insert a 3-panel visual: (1) "Hello world" highlighted with token boundaries, (2) each token shown with its id, (3) each id shown landing in a vector neighborhood, with similar words nearby.

Interactive Element

On the OpenAI tokenizer page, type the same sentence in English, Hindi, and code. Compare token counts. This is exactly how you will lose money if you ignore it.

Hands-on Lab

In a notebook (or even pen and paper), estimate the cost of a chat that uses 500 input tokens and 800 output tokens, on a model that charges $1 per million input tokens and $3 per million output tokens. Answer at end of lesson.

Mini Exercise

If a Hindi sentence is 3x more tokens than its English translation, what does that mean for cost and latency?

Common Mistakes

Estimating context window in "words" instead of tokens
Assuming all languages cost the same per character
Ignoring that JSON output uses more tokens than plain text

Debugging Tips

If responses get cut off, your max_tokens is too low. If they are slow, your input token count may be too high. Both are token-driven.

Knowledge Check Questions

What is a token? Why is it not a word?
What is an embedding? What is its shape?
Why does Hindi cost more per character than English?

Quiz Questions

The context window is measured in: a) Characters b) Words c) Tokens d) Megabytes Answer: c

Challenge Task

Use the OpenAI tokenizer to find a paragraph that produces wildly different token counts in two languages. Document it as a one-page note.

Real-world Use Cases

Token counting is required for billing dashboards.
Prompt compression libraries shrink token counts to fit context windows.
Internationalization: an Indian-language chatbot can be 3x more expensive without redesign.

Industry Insight

In 2026 production, engineers regularly cut costs 30 to 50% by improving tokenization choices: pre-summarizing inputs, switching providers whose tokenizer handles their target language better, or moving to small models for short tasks.

Interview Questions

What is the difference between a token and a word?
Why might the same sentence in Hindi cost 3x more than in English?
How do you count tokens in Python?

Summary

Text becomes tokens, tokens become embeddings, embeddings are the language the model thinks in. Cost, context limits, and latency are all token-driven.

Hands-on Lab answer: 500 input tokens at $1/M = $0.0005. 800 output at $3/M = $0.0024. Total per chat = $0.0029.

Lesson 2.3: Attention: the one idea that changed everything

Hook / Why This Matters

Attention is the engine. Without it, no ChatGPT. With it, an English sentence "knows" which earlier words modify which later ones. Understanding this one mechanism unlocks the deepest mental model in this course.

Beginner Analogy

You walk into a noisy cafe and someone says your name. Your brain instantly tunes out everything else and focuses on the voice that said it. Attention in a Transformer does the same: for every token, it figures out which other tokens "deserve focus" right now.

Concept Explanation

For each token, the model computes three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (here is my payload). Every token's query is compared to every other token's key. The matches with the strongest scores get the most "attention" and their values get blended into the current token's new representation.

This is self-attention. The "self" means the comparisons are within a single sequence.

Technical Breakdown

For a sequence of n tokens, attention is roughly an n x n score matrix. This is why long context windows are expensive: the matrix grows with the square of context length. Modern tricks (sparse attention, sliding windows, FlashAttention) reduce this but the cost is still real.

"Multi-head" attention means doing this several times in parallel with different learned projections, then combining. Each head learns to focus on a different relationship (syntax, coreference, topic).

Visual Learning Suggestion

A heatmap diagram showing a sentence on both axes ("The cat sat on the mat because it was tired"). The cell where "it" meets "cat" lights up. This is the famous "attention head learns coreference" visual.

Interactive Element

In ChatGPT, write: "The trophy did not fit in the suitcase because it was too big. What does 'it' refer to?" Then: "The trophy did not fit in the suitcase because it was too small. What does 'it' refer to?" Different answers. Same words. Attention is what flips them.

Hands-on Lab

No code yet. Write down 3 ambiguous-pronoun sentences. Predict the model's interpretation. Test in ChatGPT. Note where attention nailed it and where it slipped.

Mini Exercise

Why is attention n^2 in the sequence length? Explain in one sentence.

Common Mistakes

Thinking attention is "the model paying attention to the user". It is internal, token-to-token.
Believing attention solves long-context perfectly. It does not. Models still get lost in 200-page documents.

Debugging Tips

If a long-input prompt produces vague answers, the model may be losing attention on the middle of the input (the "lost in the middle" effect). Restructure: put the most important content at the start or end.

Knowledge Check Questions

What are queries, keys, and values? What does each do?
Why does attention scale quadratically?
What is multi-head attention and why is it useful?

Quiz Questions

Attention complexity in sequence length is: a) O(1) b) O(n) c) O(n log n) d) O(n^2) Answer: d

Challenge Task

Read the abstract of "Attention Is All You Need" (Vaswani et al., 2017) and write a 200-word explainer for a non-engineer.

Real-world Use Cases

Long-document understanding (with the lost-in-the-middle caveat)
Coreference resolution in chat
Code understanding: "this variable refers to..."

Industry Insight

Knowing the n^2 cost shape is why long-context models cost so much more. It is also why prompt compression is a real career: shrinking inputs intelligently saves orders of magnitude of compute.

Interview Questions

Explain self-attention to a non-technical PM in 60 seconds.
Why is multi-head attention better than single-head?
How would you mitigate the "lost in the middle" effect?

Summary

Attention lets each token look at every other token and decide what matters. It is the single most important architectural innovation of the modern AI era.

Lesson 2.4: Next-token prediction: ChatGPT is really autocomplete

Hook / Why This Matters

The whole magical thing is a glorified autocomplete. Once that lands, everything from hallucinations to creativity makes sense.

Beginner Analogy

You type "I am going to the gym to..." and your phone suggests "work out", "lift weights", "exercise". That is next-token prediction. ChatGPT does the same, but with 96 layers of context and a vocabulary of 200,000 tokens.

Concept Explanation

At every step, the model outputs a probability for each token in its vocabulary. The token with the highest probability is the "most likely next token". The system picks one (often not strictly the top, see "temperature") and appends it. Then it does it again. And again. Until a stop token.

The seemingly intelligent paragraph is really 800 sequential autocompletes.

Technical Breakdown

The model emits a vector of "logits" of length vocab_size. A softmax turns it into a probability distribution. Sampling strategies include:

Greedy (temperature 0): always pick the top token. Deterministic, sometimes repetitive.
Temperature sampling: divide logits by T and sample. T > 1 is wilder, T < 1 is more conservative.
Top-k: only sample from the top k tokens.
Top-p (nucleus): only sample from tokens whose cumulative probability adds to p.

Visual Learning Suggestion

A bar chart of next-token probabilities after the prompt "The capital of France is". One huge bar for "Paris", small bars for "the", "Lyon", "Marseille". Adjust the chart for "The capital of France is famous for its..." and watch the distribution flatten.

Interactive Element

In the OpenAI Playground or Google AI Studio, run the same prompt twice with temperature 0 (identical answers) and twice with temperature 1 (different answers). Feel the difference.

Hands-on Lab

Write a one-line prompt about your hometown. Run it five times at temperature 1. Observe the variation. Now set temperature to 0 and run again. Note that the outputs are identical (modulo provider non-determinism).

Mini Exercise

If you set temperature to 0 and still see different responses across runs, what is one possible reason?

Common Mistakes

Believing temperature 0 guarantees identical outputs across providers. It usually does not.
Cranking temperature to "make the model creative" instead of using prompt design.
Forgetting that the model is choosing tokens, not facts.

Debugging Tips

If you need exact, reproducible structured outputs, use temperature 0 plus structured output mode (JSON schema). If you need creative variation, temperature 0.7 to 1.0 is the sweet spot.

Knowledge Check Questions

What does temperature control?
What is the difference between top-k and top-p sampling?
Why does the model sometimes "hallucinate" plausible-sounding facts?

Quiz Questions

Hallucinations occur because: a) The model is broken b) The model is computing token probabilities, not retrieving facts c) The temperature is too low d) The prompt is too short Answer: b

Challenge Task

Build a one-shot "creative slogan generator" prompt. Optimize for variety with sampling settings rather than longer prompts.

Real-world Use Cases

Streaming chat: each token is sent to the UI as it is sampled
Cost control: stop tokens save money on long outputs
Reliable JSON: temperature 0 plus JSON mode

Industry Insight

The single most common 2026 production gotcha: a developer assumes the model "knows" something, ships it, then hallucinations crash the demo. The model never "knows". It samples. Build accordingly.

Interview Questions

What is next-token prediction?
Compare greedy, top-k, top-p, and temperature sampling.
Why are LLMs prone to hallucination?

Summary

ChatGPT is autocomplete with 96 layers. Knowing that everything is just sampling from a probability distribution makes hallucinations, creativity, and reliability all click into place.

Lesson 2.5: Pretraining vs fine-tuning vs RLHF, in plain English

Hook / Why This Matters

Every model you will use went through three life stages. Knowing them tells you what it is good at and where it will fail.

Beginner Analogy

Stage 1: read every book in the library. Stage 2: take a specialized course. Stage 3: get coached by mentors on how to behave. That is pretraining, fine-tuning, RLHF.

Concept Explanation

Pretraining: the model is fed trillions of tokens of text and learns to predict the next token. This gives it language fluency, world knowledge, and general capabilities. Costs millions of dollars.
Fine-tuning (Supervised): the model is shown thousands of pairs of (instruction, ideal response). This teaches it to follow instructions in the desired format.
RLHF (Reinforcement Learning from Human Feedback): humans rank multiple model responses. A reward model learns the rankings. The model is updated to produce higher-ranked answers. This teaches helpfulness, harmlessness, and honesty.

Modern frontier models also use RLAIF (AI feedback in place of humans) and constitutional AI to scale this stage.

Technical Breakdown

You will rarely pretrain a model yourself (cost is prohibitive). You may fine-tune one if you have 100 to 10,000 high-quality examples. You almost never run RLHF yourself for chat. Most production teams stop at fine-tuning, often using LoRA (low-rank adapters) to keep costs low.

Visual Learning Suggestion

A three-stage horizontal pipeline labeled "Pretrain (years of internet text) -> Fine-tune (instruction pairs) -> RLHF (human preferences)" with an output box labeled "ChatGPT-style assistant".

Interactive Element

Ask GPT-4 or Claude: "What was the last time you were updated?" Compare to: "Search the web for today's date." The first hits its training cutoff. The second uses tools. Train vs deploy time matters.

Hands-on Lab

Look up the "knowledge cutoff" of three different LLMs (Google a phrase like "GPT-4 knowledge cutoff", "Claude knowledge cutoff", "Gemini knowledge cutoff"). Note them. Note that this is why models do not know "current" news without tool use.

Mini Exercise

Why does RLHF make models more agreeable but sometimes less accurate?

Common Mistakes

Trying to fine-tune to "teach the model facts". It learns style and format, not durable facts.
Confusing fine-tuning with RAG. RAG retrieves at inference time. Fine-tuning bakes in behavior.
Believing a model knows things "as of today". It knows up to its training cutoff.

Debugging Tips

If your model gives stale answers, you need RAG or tools, not fine-tuning. If your model ignores format instructions, fine-tuning may help.

Knowledge Check Questions

Define pretraining, fine-tuning, and RLHF.
When would you fine-tune vs use RAG?
What is a knowledge cutoff?

Quiz Questions

To make the model answer in a specific JSON format every time, your first move should be: a) Pretrain from scratch b) Use a system prompt and JSON mode c) Run RLHF d) Buy more GPUs Answer: b

Challenge Task

Sketch a decision tree: "Should I prompt, fine-tune, or RAG this problem?". Three branches with one example each.

Real-world Use Cases

Pretraining: not your job
Fine-tuning: domain-specific writing styles, structured outputs, low-resource languages
RLHF: behavior tuning at the model provider level

Industry Insight

In 2026, prompt + RAG solves 80% of business problems. Fine-tuning solves another 15%. Pretraining is left to the labs. This ratio shapes your career: master prompting and RAG first.

Interview Questions

Walk me through the three training stages of a modern chat LLM.
When would you choose fine-tuning over RAG?
What is LoRA and why does it matter?

Summary

Pretraining builds the brain. Fine-tuning shapes the personality. RLHF gives it manners. As a developer, you mostly work above all three with prompts, RAG, and the occasional fine-tune.

Module 2 Recap

You can now narrate, from memory, the path of a prompt through ChatGPT: tokenize, embed, attend, layer, predict, sample, loop. You know why hallucinations happen, why temperature exists, and how a model goes from pretraining to your screen.

Next Module

Module 3: Tokens, Prompts, Context Windows, and AI Conversations