Module 2: How ChatGPT and Transformers Work (Beginner Friendly)
Module Goal
Lift the hood. By the end of this module you can explain on a napkin how ChatGPT goes from your prompt to its answer, what a Transformer actually does, and why "attention" matters. No heavy math.
Estimated Duration
4 to 6 hours.
Skills Learned
- Explaining tokens, embeddings, attention, and next-token prediction in plain English
- Tracing the path of a single prompt through an LLM
- Reasoning about why LLMs hallucinate and forget
- Reading a model spec without panic
Real-world Importance
Engineers who understand the machine ship better products. They write prompts that respect how the model thinks, debug failures by tracing them to the architecture, and pick models with clear-eyed reasoning instead of brand loyalty.
Lessons in this module
- From prompt to answer: the 6-step journey
- Tokens and embeddings: how text becomes numbers
- Attention: the one idea that changed everything
- Next-token prediction: ChatGPT is really autocomplete
- Pretraining vs fine-tuning vs RLHF, in plain English
Lesson 2.1: From prompt to answer: the 6-step journey
Hook / Why This Matters
When you press Enter on ChatGPT, six precise things happen. Every later module hooks into one of those steps. This is your skeleton.
Beginner Analogy
Imagine a translator at the UN. They hear your words, mentally turn each into a concept, listen to the rest of the sentence to decide what each concept emphasizes, then say one word at a time in the target language, looping back to keep what they have already said in mind. Transformers do exactly that, but with numbers.
Concept Explanation
The six steps:
- Tokenization: your text is chopped into small chunks called tokens.
- Embedding: each token is turned into a vector of numbers.
- Attention: each token "looks at" the others to decide which matter for its meaning in this context.
- Stacked layers: this attention process repeats across many layers, refining understanding.
- Next-token prediction: the model outputs probabilities over every possible next token.
- Sampling and loop: one token is picked (by probability), appended to the input, and the whole loop runs again until a stop signal.
That is ChatGPT. Everything else (RAG, fine-tuning, agents) is wrapping or augmenting these six steps.
Technical Breakdown
The model is a stack of identical layers (often 32 to 96 of them in modern LLMs). Each layer has two sub-layers: self-attention and a feed-forward network. The output of each layer is the input to the next. The final layer projects back to vocabulary size and a softmax produces probabilities.
Visual Learning Suggestion
A horizontal pipeline diagram with 6 boxes labeled with the 6 steps. Arrows between. Loop arrow from step 6 back to step 1 showing autoregressive generation.
Interactive Element
Ask ChatGPT: "What is your next token going to be after 'The capital of France is'? Show the top 5 candidates if you can." Notice it strongly prefers "Paris" but is technically computing a distribution.
Hands-on Lab
Open https://platform.openai.com/tokenizer. Paste any 50-word paragraph. Observe how it gets split. Note how spaces, punctuation, and uncommon words get split differently. Tokens are not words.
Mini Exercise
Count: how many tokens is the sentence "Hello, world! This is GeekHub."? Use the tokenizer tool.
Common Mistakes
- Thinking the model "looks up" answers. It does not. It computes them.
- Thinking the model writes whole sentences at once. It writes one token at a time.
- Confusing tokens with words. They are not the same.
Debugging Tips
If your app behaves oddly at high token counts, check whether you are bumping against the context window. The loop in step 6 cannot remember anything outside the current context.
Knowledge Check Questions
- Recite the six steps without looking.
- Why is generation called "autoregressive"?
- What ends the generation loop?
Quiz Questions
- Generation stops when: a) The model finishes a sentence b) A stop token is sampled or max_tokens is hit c) The user closes the tab d) The temperature reaches zero Answer: b
Challenge Task
Draw the 6-step diagram from memory on paper. Photograph it. Post it on GeekHub with #ai-beginners.
Real-world Use Cases
- Streaming chat UIs: the autoregressive loop is what lets you watch the response appear word by word.
- Token-level cost analysis: knowing tokens are emitted one at a time lets you predict pricing.
- Function/tool calling: the model emits a structured token sequence the runtime intercepts.
Industry Insight
Knowing this loop is what lets you write production code that calls an LLM with stream=True. It is also why you can cancel a generation mid-flight, saving money on long outputs.
Interview Questions
- Walk me through what happens after a user submits a prompt to ChatGPT.
- Why is LLM generation called autoregressive?
- How does early stopping save cost?
Summary
ChatGPT is a six-step pipeline: tokenize, embed, attend, layer, predict next token, sample and loop. Everything else builds on this.
Lesson 2.2: Tokens and embeddings: how text becomes numbers
Hook / Why This Matters
Neural networks do not understand text. They only do math on numbers. The conversion from text to numbers is where many engineers get confused and where every API price tag lives.
Beginner Analogy
Think of a multilingual phone book. Every word gets a unique phone number. Words with similar meanings get phone numbers nearby. That neighborhood structure is the "embedding".
Concept Explanation
Tokens: not words. Tokens are sub-word units chosen by the model's tokenizer (often Byte-Pair Encoding). The English word "tokenization" might be 3 tokens. The Hindi word "नमस्ते" might be 5 tokens. Code symbols often pack into 1 token.
Embeddings: each token id is mapped to a fixed-length vector of floats (often 1,024 or 4,096 numbers). This vector is the model's internal representation. Similar meanings produce nearby vectors.
Technical Breakdown
The tokenizer is a deterministic algorithm that maps strings to a vocabulary of token ids (often 50K to 200K). The model has an embedding matrix of shape [vocab_size, hidden_dim]. Looking up a token id retrieves its starting vector. Subsequent layers refine that vector contextually.
The "context window" is measured in tokens, not characters or words. A 128K context model can hold roughly 96,000 English words.
Visual Learning Suggestion
Insert a 3-panel visual: (1) "Hello world" highlighted with token boundaries, (2) each token shown with its id, (3) each id shown landing in a vector neighborhood, with similar words nearby.
Interactive Element
On the OpenAI tokenizer page, type the same sentence in English, Hindi, and code. Compare token counts. This is exactly how you will lose money if you ignore it.
Hands-on Lab
In a notebook (or even pen and paper), estimate the cost of a chat that uses 500 input tokens and 800 output tokens, on a model that charges $1 per million input tokens and $3 per million output tokens. Answer at end of lesson.
Mini Exercise
If a Hindi sentence is 3x more tokens than its English translation, what does that mean for cost and latency?
Common Mistakes
- Estimating context window in "words" instead of tokens
- Assuming all languages cost the same per character
- Ignoring that JSON output uses more tokens than plain text
Debugging Tips
If responses get cut off, your max_tokens is too low. If they are slow, your input token count may be too high. Both are token-driven.
Knowledge Check Questions
- What is a token? Why is it not a word?
- What is an embedding? What is its shape?
- Why does Hindi cost more per character than English?
Quiz Questions
- The context window is measured in: a) Characters b) Words c) Tokens d) Megabytes Answer: c
Challenge Task
Use the OpenAI tokenizer to find a paragraph that produces wildly different token counts in two languages. Document it as a one-page note.
Real-world Use Cases
- Token counting is required for billing dashboards.
- Prompt compression libraries shrink token counts to fit context windows.
- Internationalization: an Indian-language chatbot can be 3x more expensive without redesign.
Industry Insight
In 2026 production, engineers regularly cut costs 30 to 50% by improving tokenization choices: pre-summarizing inputs, switching providers whose tokenizer handles their target language better, or moving to small models for short tasks.
Interview Questions
- What is the difference between a token and a word?
- Why might the same sentence in Hindi cost 3x more than in English?
- How do you count tokens in Python?
Summary
Text becomes tokens, tokens become embeddings, embeddings are the language the model thinks in. Cost, context limits, and latency are all token-driven.
Hands-on Lab answer: 500 input tokens at $1/M = $0.0005. 800 output at $3/M = $0.0024. Total per chat = $0.0029.
Lesson 2.3: Attention: the one idea that changed everything
Hook / Why This Matters
Attention is the engine. Without it, no ChatGPT. With it, an English sentence "knows" which earlier words modify which later ones. Understanding this one mechanism unlocks the deepest mental model in this course.
Beginner Analogy
You walk into a noisy cafe and someone says your name. Your brain instantly tunes out everything else and focuses on the voice that said it. Attention in a Transformer does the same: for every token, it figures out which other tokens "deserve focus" right now.
Concept Explanation
For each token, the model computes three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (here is my payload). Every token's query is compared to every other token's key. The matches with the strongest scores get the most "attention" and their values get blended into the current token's new representation.
This is self-attention. The "self" means the comparisons are within a single sequence.
Technical Breakdown
For a sequence of n tokens, attention is roughly an n x n score matrix. This is why long context windows are expensive: the matrix grows with the square of context length. Modern tricks (sparse attention, sliding windows, FlashAttention) reduce this but the cost is still real.
"Multi-head" attention means doing this several times in parallel with different learned projections, then combining. Each head learns to focus on a different relationship (syntax, coreference, topic).
Visual Learning Suggestion
A heatmap diagram showing a sentence on both axes ("The cat sat on the mat because it was tired"). The cell where "it" meets "cat" lights up. This is the famous "attention head learns coreference" visual.
Interactive Element
In ChatGPT, write: "The trophy did not fit in the suitcase because it was too big. What does 'it' refer to?" Then: "The trophy did not fit in the suitcase because it was too small. What does 'it' refer to?" Different answers. Same words. Attention is what flips them.
Hands-on Lab
No code yet. Write down 3 ambiguous-pronoun sentences. Predict the model's interpretation. Test in ChatGPT. Note where attention nailed it and where it slipped.
Mini Exercise
Why is attention n^2 in the sequence length? Explain in one sentence.
Common Mistakes
- Thinking attention is "the model paying attention to the user". It is internal, token-to-token.
- Believing attention solves long-context perfectly. It does not. Models still get lost in 200-page documents.
Debugging Tips
If a long-input prompt produces vague answers, the model may be losing attention on the middle of the input (the "lost in the middle" effect). Restructure: put the most important content at the start or end.
Knowledge Check Questions
- What are queries, keys, and values? What does each do?
- Why does attention scale quadratically?
- What is multi-head attention and why is it useful?
Quiz Questions
- Attention complexity in sequence length is: a) O(1) b) O(n) c) O(n log n) d) O(n^2) Answer: d
Challenge Task
Read the abstract of "Attention Is All You Need" (Vaswani et al., 2017) and write a 200-word explainer for a non-engineer.
Real-world Use Cases
- Long-document understanding (with the lost-in-the-middle caveat)
- Coreference resolution in chat
- Code understanding: "this variable refers to..."
Industry Insight
Knowing the n^2 cost shape is why long-context models cost so much more. It is also why prompt compression is a real career: shrinking inputs intelligently saves orders of magnitude of compute.
Interview Questions
- Explain self-attention to a non-technical PM in 60 seconds.
- Why is multi-head attention better than single-head?
- How would you mitigate the "lost in the middle" effect?
Summary
Attention lets each token look at every other token and decide what matters. It is the single most important architectural innovation of the modern AI era.
Lesson 2.4: Next-token prediction: ChatGPT is really autocomplete
Hook / Why This Matters
The whole magical thing is a glorified autocomplete. Once that lands, everything from hallucinations to creativity makes sense.
Beginner Analogy
You type "I am going to the gym to..." and your phone suggests "work out", "lift weights", "exercise". That is next-token prediction. ChatGPT does the same, but with 96 layers of context and a vocabulary of 200,000 tokens.
Concept Explanation
At every step, the model outputs a probability for each token in its vocabulary. The token with the highest probability is the "most likely next token". The system picks one (often not strictly the top, see "temperature") and appends it. Then it does it again. And again. Until a stop token.
The seemingly intelligent paragraph is really 800 sequential autocompletes.
Technical Breakdown
The model emits a vector of "logits" of length vocab_size. A softmax turns it into a probability distribution. Sampling strategies include:
- Greedy (temperature 0): always pick the top token. Deterministic, sometimes repetitive.
- Temperature sampling: divide logits by
Tand sample.T > 1is wilder,T < 1is more conservative. - Top-k: only sample from the top k tokens.
- Top-p (nucleus): only sample from tokens whose cumulative probability adds to
p.
Visual Learning Suggestion
A bar chart of next-token probabilities after the prompt "The capital of France is". One huge bar for "Paris", small bars for "the", "Lyon", "Marseille". Adjust the chart for "The capital of France is famous for its..." and watch the distribution flatten.
Interactive Element
In the OpenAI Playground or Google AI Studio, run the same prompt twice with temperature 0 (identical answers) and twice with temperature 1 (different answers). Feel the difference.
Hands-on Lab
Write a one-line prompt about your hometown. Run it five times at temperature 1. Observe the variation. Now set temperature to 0 and run again. Note that the outputs are identical (modulo provider non-determinism).
Mini Exercise
If you set temperature to 0 and still see different responses across runs, what is one possible reason?
Common Mistakes
- Believing temperature 0 guarantees identical outputs across providers. It usually does not.
- Cranking temperature to "make the model creative" instead of using prompt design.
- Forgetting that the model is choosing tokens, not facts.
Debugging Tips
If you need exact, reproducible structured outputs, use temperature 0 plus structured output mode (JSON schema). If you need creative variation, temperature 0.7 to 1.0 is the sweet spot.
Knowledge Check Questions
- What does temperature control?
- What is the difference between top-k and top-p sampling?
- Why does the model sometimes "hallucinate" plausible-sounding facts?
Quiz Questions
- Hallucinations occur because: a) The model is broken b) The model is computing token probabilities, not retrieving facts c) The temperature is too low d) The prompt is too short Answer: b
Challenge Task
Build a one-shot "creative slogan generator" prompt. Optimize for variety with sampling settings rather than longer prompts.
Real-world Use Cases
- Streaming chat: each token is sent to the UI as it is sampled
- Cost control: stop tokens save money on long outputs
- Reliable JSON: temperature 0 plus JSON mode
Industry Insight
The single most common 2026 production gotcha: a developer assumes the model "knows" something, ships it, then hallucinations crash the demo. The model never "knows". It samples. Build accordingly.
Interview Questions
- What is next-token prediction?
- Compare greedy, top-k, top-p, and temperature sampling.
- Why are LLMs prone to hallucination?
Summary
ChatGPT is autocomplete with 96 layers. Knowing that everything is just sampling from a probability distribution makes hallucinations, creativity, and reliability all click into place.
Lesson 2.5: Pretraining vs fine-tuning vs RLHF, in plain English
Hook / Why This Matters
Every model you will use went through three life stages. Knowing them tells you what it is good at and where it will fail.
Beginner Analogy
Stage 1: read every book in the library. Stage 2: take a specialized course. Stage 3: get coached by mentors on how to behave. That is pretraining, fine-tuning, RLHF.
Concept Explanation
- Pretraining: the model is fed trillions of tokens of text and learns to predict the next token. This gives it language fluency, world knowledge, and general capabilities. Costs millions of dollars.
- Fine-tuning (Supervised): the model is shown thousands of pairs of (instruction, ideal response). This teaches it to follow instructions in the desired format.
- RLHF (Reinforcement Learning from Human Feedback): humans rank multiple model responses. A reward model learns the rankings. The model is updated to produce higher-ranked answers. This teaches helpfulness, harmlessness, and honesty.
Modern frontier models also use RLAIF (AI feedback in place of humans) and constitutional AI to scale this stage.
Technical Breakdown
You will rarely pretrain a model yourself (cost is prohibitive). You may fine-tune one if you have 100 to 10,000 high-quality examples. You almost never run RLHF yourself for chat. Most production teams stop at fine-tuning, often using LoRA (low-rank adapters) to keep costs low.
Visual Learning Suggestion
A three-stage horizontal pipeline labeled "Pretrain (years of internet text) -> Fine-tune (instruction pairs) -> RLHF (human preferences)" with an output box labeled "ChatGPT-style assistant".
Interactive Element
Ask GPT-4 or Claude: "What was the last time you were updated?" Compare to: "Search the web for today's date." The first hits its training cutoff. The second uses tools. Train vs deploy time matters.
Hands-on Lab
Look up the "knowledge cutoff" of three different LLMs (Google a phrase like "GPT-4 knowledge cutoff", "Claude knowledge cutoff", "Gemini knowledge cutoff"). Note them. Note that this is why models do not know "current" news without tool use.
Mini Exercise
Why does RLHF make models more agreeable but sometimes less accurate?
Common Mistakes
- Trying to fine-tune to "teach the model facts". It learns style and format, not durable facts.
- Confusing fine-tuning with RAG. RAG retrieves at inference time. Fine-tuning bakes in behavior.
- Believing a model knows things "as of today". It knows up to its training cutoff.
Debugging Tips
If your model gives stale answers, you need RAG or tools, not fine-tuning. If your model ignores format instructions, fine-tuning may help.
Knowledge Check Questions
- Define pretraining, fine-tuning, and RLHF.
- When would you fine-tune vs use RAG?
- What is a knowledge cutoff?
Quiz Questions
- To make the model answer in a specific JSON format every time, your first move should be: a) Pretrain from scratch b) Use a system prompt and JSON mode c) Run RLHF d) Buy more GPUs Answer: b
Challenge Task
Sketch a decision tree: "Should I prompt, fine-tune, or RAG this problem?". Three branches with one example each.
Real-world Use Cases
- Pretraining: not your job
- Fine-tuning: domain-specific writing styles, structured outputs, low-resource languages
- RLHF: behavior tuning at the model provider level
Industry Insight
In 2026, prompt + RAG solves 80% of business problems. Fine-tuning solves another 15%. Pretraining is left to the labs. This ratio shapes your career: master prompting and RAG first.
Interview Questions
- Walk me through the three training stages of a modern chat LLM.
- When would you choose fine-tuning over RAG?
- What is LoRA and why does it matter?
Summary
Pretraining builds the brain. Fine-tuning shapes the personality. RLHF gives it manners. As a developer, you mostly work above all three with prompts, RAG, and the occasional fine-tune.
Module 2 Recap
You can now narrate, from memory, the path of a prompt through ChatGPT: tokenize, embed, attend, layer, predict, sample, loop. You know why hallucinations happen, why temperature exists, and how a model goes from pretraining to your screen.
SEO Notes
- Primary keyword: "how ChatGPT works"
- Featured snippet target: the "6-step journey" list in Lesson 2.1
- Internal links: Modules 1, 3, and 4. Link to "AI tools" hub.
- AISO formatting: definition blocks at the top of each lesson, comparison tables, FAQs at module end.
Next Module
Module 3: Tokens, Prompts, Context Windows, and AI Conversations