Module 3: Tokens, Prompts, Context Windows, and AI Conversations
Module Goal
Become fluent in the four units every AI engineer thinks in daily: tokens, prompts, context windows, and conversation turns. By the end you can predict cost, debug truncation, and design multi-turn flows that do not blow up.
Estimated Duration
3 to 4 hours.
Skills Learned
- Counting tokens for any input or output
- Distinguishing system, user, and assistant messages
- Designing prompts that fit comfortably in a context window
- Managing multi-turn conversation memory
- Estimating API cost before you spend it
Real-world Importance
In production, the difference between a feature that scales and one that bankrupts you is almost always token discipline. Engineers who track tokens shipped reliable apps. Those who do not got bill-shocked.
Lessons in this module
- Tokens, in depth: how to count and why it matters
- The anatomy of a prompt: system, user, assistant
- Context windows: what fits, what gets cut, and the lost-in-the-middle effect
- Multi-turn conversations: how memory really works
- Token math and cost estimation
Lesson 3.1: Tokens, in depth: how to count and why it matters
Hook / Why This Matters
Tokens are the currency of LLMs. You pay per token, you are limited per token, you are throttled per token. Learn to think in tokens or pay extra forever.
Beginner Analogy
Tokens are like Uber's per-kilometer pricing. Words are the rider's mental model of distance. The driver only cares about meters. You are the driver now.
Concept Explanation
A token is a small chunk of text the model treats as one unit. English averages about 1.3 tokens per word. Hindi, Arabic, Tamil, and many other non-Latin scripts often run 2 to 4 tokens per word. Code uses far fewer tokens per character because symbols pack tightly.
Input tokens (your prompt) and output tokens (the response) are usually priced differently. Output is more expensive almost everywhere.
Technical Breakdown
The tokenizer is provider-specific. OpenAI's GPT-4 family uses o200k_base (200K vocabulary). Anthropic's Claude uses its own. Google's Gemini uses SentencePiece. Token counts will differ across providers even for the same text.
In Python, count tokens with:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
text = "Hello, world!"
tokens = enc.encode(text)
print(len(tokens), tokens)
Visual Learning Suggestion
A 3-row table:
- "ChatGPT is amazing" -> 3 tokens
- "Pneumonoultramicroscopicsilicovolcanoconiosis" -> ~10 tokens
def hello(): print("hi")-> ~6 tokens
Interactive Element
Tokenize your own name, a famous quote, a JSON snippet, and a paragraph in your native language using https://platform.openai.com/tokenizer. Save the four counts.
Hands-on Lab
Install Python and run:
pip install tiktoken
Then run a small script that takes a file and prints the token count. This is the building block of every cost estimator you will build.
Mini Exercise
If output tokens cost 3x input tokens and your typical request is 200 input + 800 output, what fraction of your bill is output?
Common Mistakes
- Assuming every provider gives the same count
- Forgetting that whitespace, newlines, and emojis count as tokens
- Counting words instead of tokens when planning a context budget
Debugging Tips
When you see "context length exceeded", look at input + system prompt + history. The output reserve also eats into your limit.
Knowledge Check Questions
- Why are token counts provider-specific?
- What is the typical ratio of tokens to English words?
- Why is output usually pricier than input?
Quiz Questions
- Which is most likely to use the most tokens?
a) "Hello world" in English
b) "Hello world" in Hindi
c) "Hello world" as
print("hello world")d) All equal Answer: b
Challenge Task
Build a CLI that takes a file and prints (a) token count, (b) estimated input cost, (c) estimated total cost for a chat with 1.5x output ratio at GPT-4o pricing.
Real-world Use Cases
- Pre-flight token budget check before calling expensive long-context models
- Internal cost dashboards
- Multilingual feature pricing decisions
Industry Insight
The fastest cost savings in 2026 production usually come from tokenizer-aware prompt shaping: dropping unnecessary boilerplate, choosing models with better tokenizers for your target language, and using small models for high-frequency calls.
Interview Questions
- How do you count tokens for a request before sending it?
- How would you reduce token usage for a Hindi-language chatbot?
- What is the difference in token count between equivalent text and JSON?
Summary
Tokens are billing units. Count them, budget them, optimize them. Every senior AI engineer thinks in tokens before they think in words.
Lesson 3.2: The anatomy of a prompt: system, user, assistant
Hook / Why This Matters
If you ever wondered why ChatGPT "remembers it is Claude" or "stays in character", it is the system message. Mastering message roles is what separates hobbyists from engineers.
Beginner Analogy
A play has three speakers: the director (system), the audience member who asks (user), the actor who responds (assistant). The director's notes set the stage. The audience asks. The actor delivers, then waits for the next prompt.
Concept Explanation
Modern chat APIs use a list of messages, each with a role:
- system: instructions, persona, constraints, format rules. Persistent and high priority.
- user: the human's request.
- assistant: the model's previous responses, sent back so the model has memory of what it already said.
- tool (newer): structured outputs from tool calls.
The API takes the entire array each call. There is no hidden server-side memory unless you build it. You manage memory.
Technical Breakdown
A typical OpenAI call:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a concise tutor. Answer in 2 sentences."},
{"role": "user", "content": "Explain RAG."},
],
)
print(response.choices[0].message.content)
Add the assistant reply back into messages for the next turn. That is conversation memory.
Visual Learning Suggestion
A vertical message list visualization: a settings-icon next to "system", a person-icon next to "user", a sparkles-icon next to "assistant". Show how the array grows turn by turn.
Interactive Element
Open Google AI Studio or OpenAI Playground. Add a system message that says "Always reply in haiku." Send any prompt. Notice the constraint persists across turns.
Hands-on Lab
Write a 5-turn conversation by hand as a JSON array of messages. Include one system, three user, three assistant entries. This is the "wire format" of every chat app.
Mini Exercise
What happens if you put a system instruction inside a user message instead? Is it weaker, stronger, or the same?
Common Mistakes
- Forgetting to append assistant replies to history (model "forgets" each turn)
- Putting persona inside every user message instead of once in system
- Building "memory" via repeated system updates instead of clean message arrays
Debugging Tips
When the model loses persona, check that your system message is still in the array. When it loops, check you are not double-appending.
Knowledge Check Questions
- What are the three core message roles?
- What does the assistant role represent?
- How does the model know what it said earlier?
Quiz Questions
- To make a chatbot remember what it said two turns ago, you must: a) Set a flag on the API b) Re-include past assistant and user messages in the array c) Use a different model d) Use a vector database Answer: b
Challenge Task
Build a small script that maintains a chat in a Python list and prints history nicely. No frameworks. Just messages.append(...).
Real-world Use Cases
- Chatbots with personas (customer support, tutor, sales)
- Multi-step agents that hand off to one another
- Tool-using assistants that emit tool messages
Industry Insight
In production, you do not just keep growing the messages array forever. You "truncate", "summarize", or "rolling-window" history. Module 6 will show you how.
Interview Questions
- What is the difference between system, user, and assistant messages?
- How would you maintain memory across 50 turns without blowing the context window?
- Why is system message often more "obeyed" than instructions inside user content?
Summary
A prompt is an ordered list of role-tagged messages, not a single string. System sets the stage, user asks, assistant remembers. You manage the array, the API replays it.
Lesson 3.3: Context windows: what fits, what gets cut, and the lost-in-the-middle effect
Hook / Why This Matters
A 1M-token context window sounds infinite. It is not. And paying to send 800K tokens you do not need is a real, expensive 2026 mistake.
Beginner Analogy
Your short-term memory holds about seven things at once. The model's context window is much bigger, but still finite. And just like you, things in the middle of a long list get forgotten first.
Concept Explanation
The context window is the maximum number of tokens (input + output) the model can consider in one request. Examples in 2026:
- GPT-4o: 128K
- Claude 4.x family: 200K
- Gemini 2.x: 1M (some variants)
When your input exceeds the window, you must truncate, summarize, or use RAG. When the input is huge but fits, expect:
- Higher cost and latency.
- The "lost in the middle" effect: facts placed in the middle of a very long input are recalled worse than facts at the start or end.
Technical Breakdown
The window includes system + all messages + the response budget you reserve via max_tokens. If you have a 128K window and set max_tokens=8000, your input ceiling is effectively 120K.
Some providers offer "prompt caching" that lowers cost for repeated long prefixes. This is essential when you reuse the same big system prompt for many users.
Visual Learning Suggestion
A long horizontal bar representing the context window, divided into "system", "history", "current user", "response budget". Color the "response budget" differently. Overlay a U-shape "recall accuracy" curve to illustrate lost-in-the-middle.
Interactive Element
Paste a long document (5-10K words) into ChatGPT or Claude. Put a unique factoid at the start, middle, and end. Ask three questions, one targeting each. Note which the model nails and which it misses.
Hands-on Lab
Build a tiny script that takes a file, counts tokens, and warns if it exceeds 80% of the chosen model's window.
Mini Exercise
You have a 200K window. Your system + history is 80K tokens. You want to reserve 8K for the response. What is the max user input token count?
Common Mistakes
- Thinking "max_tokens" is the input limit. It is the output reserve.
- Stuffing every PDF into one prompt instead of using RAG
- Ignoring the lost-in-the-middle effect when designing long prompts
Debugging Tips
If recall is poor on a long input, restructure: put the most important content near the top and bottom, with summaries in the middle.
Knowledge Check Questions
- What does the context window include?
- What is the lost-in-the-middle effect?
- Why is sending huge inputs not a free upgrade just because the window is big?
Quiz Questions
- To improve recall on a long document, you should: a) Set temperature to 0 b) Place key facts near the start and end c) Use a smaller model d) Disable streaming Answer: b
Challenge Task
Run a "needle in a haystack" test on any model. Generate a 50K-token text, hide a single unique sentence ("the secret code is 9j2k") at three depths, and measure recall.
Real-world Use Cases
- Document QA tools
- Long-form coding assistants
- Codebase-aware refactoring tools
Industry Insight
2026 prompt design has moved from "stuff everything in" to "select the right 10K tokens via retrieval and put them where the model will see them best". That is RAG plus prompt engineering.
Interview Questions
- What is a context window and how do you budget it?
- Explain the lost-in-the-middle effect and how to mitigate it.
- Why does 1M context not eliminate the need for RAG?
Summary
Context windows are big, finite, and lossy in the middle. Engineer prompts to fit, place critical content well, and prefer retrieval over stuffing.
Lesson 3.4: Multi-turn conversations: how memory really works
Hook / Why This Matters
A chatbot that "remembers" is doing one of three things: re-sending the whole history, summarizing it, or retrieving from a memory store. Knowing which avoids the most common production bugs.
Beginner Analogy
Imagine a meeting where every five minutes a new attendee joins. To keep up, you either replay the whole tape, hand them a summary, or look up what they need from a shared notes app. LLMs have the same three options.
Concept Explanation
Three memory strategies:
- Full history (replay): include every prior message. Highest fidelity, highest cost, blows up beyond a few thousand turns.
- Rolling window: keep the last N messages. Cheap, loses old facts.
- Summarized memory: after every K turns, summarize older history into a single message. Cheap, decent recall.
- Retrieval-based memory: store messages in a vector DB, retrieve the most relevant ones at each turn. Scales infinitely, more complex.
Most production chatbots use a hybrid: rolling window for recent turns + summary for older + retrieval for important facts.
Technical Breakdown
For a Streamlit demo, full history fits easily. For an app with thousands of turns, you need summarization. Here is the summarization pattern:
if len(messages) > 20:
older = messages[:-10]
summary = summarize(older) # one LLM call
messages = [{"role": "system", "content": f"Earlier summary: {summary}"}] + messages[-10:]
Visual Learning Suggestion
Four small diagrams side by side, one per strategy, with arrows showing message flow and a token-cost label.
Interactive Element
Have a 10-turn conversation with ChatGPT where in turn 1 you say "my favorite number is 42". By turn 10 ask "what is my favorite number". Note recall. Then start a new chat (no memory). Note loss.
Hands-on Lab
Extend your Lesson 3.2 chat script to use a rolling window of 10 messages, and warn when older messages get dropped.
Mini Exercise
Why is "summarized memory" not perfect? Where does it fail?
Common Mistakes
- Forgetting to persist memory across user sessions (memory dies on refresh)
- Mixing memory of multiple users when scaling
- Trusting the model to "remember" without sending the history
Debugging Tips
If users complain "it forgot what I said", check your memory strategy. Almost always one of: missing assistant append, rolling window too small, no summarization.
Knowledge Check Questions
- Name three memory strategies.
- When does a rolling window break down?
- Why is retrieval-based memory needed for long-lived agents?
Quiz Questions
- For a customer support bot that runs for months per user, the best memory strategy is: a) Full history b) Rolling window c) Retrieval-based d) None, restart each session Answer: c
Challenge Task
Implement summarized memory: after every 10 messages, call the model to compress older turns into a single 200-token summary.
Real-world Use Cases
- Customer support chatbots
- AI tutors that remember your goals across sessions
- Coding copilots that recall your project conventions
Industry Insight
In 2026 the hottest memory pattern is "structured memory": extract facts (preferences, goals, identities) into a small JSON store and re-inject them as system text, instead of replaying raw turns.
Interview Questions
- How does a chatbot "remember"?
- Compare rolling window vs summarization vs retrieval memory.
- How do you isolate memory between users?
Summary
LLMs are stateless. Memory is an application concern. Pick a strategy (replay, window, summary, retrieval) and engineer it deliberately.
Lesson 3.5: Token math and cost estimation
Hook / Why This Matters
The single most important spreadsheet of your AI career is the cost-per-feature estimator. This lesson gives you the formula.
Beginner Analogy
Calculating LLM cost is like calculating a phone bill. Per-minute price times minutes used, plus optional extras. Easy. People skip it and get shocked anyway.
Concept Explanation
Cost per request = (input_tokens / 1,000,000 * input_price) + (output_tokens / 1,000,000 * output_price).
Cost per feature per month = cost per request * requests per user per month * users.
Examples (illustrative 2026 prices):
| Model | Input ($/M tok) | Output ($/M tok) |
|---|---|---|
| GPT-4o mini | 0.15 | 0.60 |
| GPT-4o | 2.50 | 10.00 |
| Claude Haiku-class | 0.25 | 1.25 |
| Claude Sonnet-class | 3.00 | 15.00 |
| Gemini Flash | 0.075 | 0.30 |
(Check live prices before committing. These change quarterly.)
Technical Breakdown
Build the estimator in a spreadsheet with columns: model, input price, output price, avg input tokens, avg output tokens, requests/user/month, users, cost/user/month, total/month. Add a "buffer 30%" column for spikes. This is your monthly burn projection.
Visual Learning Suggestion
A stacked bar chart: per-feature monthly cost broken down by model. Useful for showing in a planning meeting.
Interactive Element
Estimate the cost of a chatbot used 10 times per day by 1,000 users, average 300 input + 500 output tokens per request, on GPT-4o-mini. Answer at end.
Hands-on Lab
Build the estimator spreadsheet from scratch. Save it. You will use it for every project in this course.
Mini Exercise
If switching from GPT-4o to GPT-4o-mini drops cost 90% but quality 10%, when is the switch worth it?
Common Mistakes
- Forgetting to include the system prompt token count (it ships every request)
- Forgetting that long conversation history grows input tokens linearly per turn
- Ignoring failed requests (retries cost too)
Debugging Tips
If cost dashboards spike, check three things: prompt size grew, output is longer than expected, traffic spiked. Almost always one of those.
Knowledge Check Questions
- How do you compute cost per request?
- What two costs grow with conversation length?
- When is small-model "good enough" the right call?
Quiz Questions
- To halve API spend with least quality loss, your first move is usually: a) Use a more expensive model b) Switch to a smaller, capable model for non-critical paths c) Disable streaming d) Increase max_tokens Answer: b
Challenge Task
Pick a hypothetical product and produce a one-page cost memo with three scenarios (low, expected, high) and a recommendation.
Real-world Use Cases
- Pricing your SaaS feature
- Choosing between providers
- Justifying a model switch to your team
Industry Insight
The 2026 job title "AI cost engineer" did not exist in 2023. It now does. Anyone fluent in token math can become the highest-impact engineer on the team within a quarter.
Interview Questions
- Walk me through how you would estimate the cost of a chatbot feature.
- What is the typical input/output cost ratio?
- How do you control costs in production?
Summary
Cost in LLM apps is deterministic if you count tokens. Build the spreadsheet, monitor it, and the bill never surprises you.
Interactive Element answer: per request = (300/1M * 0.15) + (500/1M * 0.60) = $0.000045 + $0.0003 = ~$0.000345. Per user/month = 10 * 30 * $0.000345 = ~$0.10. 1000 users = ~$100/month.
Module 3 Recap
You can count tokens, design message arrays, plan within a context window, manage multi-turn memory, and estimate cost. You are now equipped to talk like an AI engineer in any planning meeting.
SEO Notes
- Primary keyword: "tokens prompts context windows"
- Featured snippet target: the message role anatomy table in Lesson 3.2 and the pricing table in Lesson 3.5
- Internal links: Module 2 (foundations), Module 4 (next), Module 5 (APIs)