Module 3: Tokens, Prompts, Context Windows, and AI Conversations

Module Goal

Become fluent in the four units every AI engineer thinks in daily: tokens, prompts, context windows, and conversation turns. By the end you can predict cost, debug truncation, and design multi-turn flows that do not blow up.

Estimated Duration

3 to 4 hours.

Skills Learned

Counting tokens for any input or output
Distinguishing system, user, and assistant messages
Designing prompts that fit comfortably in a context window
Managing multi-turn conversation memory
Estimating API cost before you spend it

Real-world Importance

In production, the difference between a feature that scales and one that bankrupts you is almost always token discipline. Engineers who track tokens shipped reliable apps. Those who do not got bill-shocked.

Lessons in this module

Tokens, in depth: how to count and why it matters
The anatomy of a prompt: system, user, assistant
Context windows: what fits, what gets cut, and the lost-in-the-middle effect
Multi-turn conversations: how memory really works
Token math and cost estimation

Lesson 3.1: Tokens, in depth: how to count and why it matters

Hook / Why This Matters

Tokens are the currency of LLMs. You pay per token, you are limited per token, you are throttled per token. Learn to think in tokens or pay extra forever.

Beginner Analogy

Tokens are like Uber's per-kilometer pricing. Words are the rider's mental model of distance. The driver only cares about meters. You are the driver now.

Concept Explanation

A token is a small chunk of text the model treats as one unit. English averages about 1.3 tokens per word. Hindi, Arabic, Tamil, and many other non-Latin scripts often run 2 to 4 tokens per word. Code uses far fewer tokens per character because symbols pack tightly.

Input tokens (your prompt) and output tokens (the response) are usually priced differently. Output is more expensive almost everywhere.

Technical Breakdown

The tokenizer is provider-specific. OpenAI's GPT-4 family uses o200k_base (200K vocabulary). Anthropic's Claude uses its own. Google's Gemini uses SentencePiece. Token counts will differ across providers even for the same text.

In Python, count tokens with:

import tiktoken
enc = tiktoken.get_encoding("o200k_base")
text = "Hello, world!"
tokens = enc.encode(text)
print(len(tokens), tokens)

Visual Learning Suggestion

A 3-row table:

"ChatGPT is amazing" -> 3 tokens
"Pneumonoultramicroscopicsilicovolcanoconiosis" -> ~10 tokens
def hello(): print("hi") -> ~6 tokens

Interactive Element

Tokenize your own name, a famous quote, a JSON snippet, and a paragraph in your native language using https://platform.openai.com/tokenizer. Save the four counts.

Hands-on Lab

Install Python and run:

pip install tiktoken

Then run a small script that takes a file and prints the token count. This is the building block of every cost estimator you will build.

Mini Exercise

If output tokens cost 3x input tokens and your typical request is 200 input + 800 output, what fraction of your bill is output?

Common Mistakes

Assuming every provider gives the same count
Forgetting that whitespace, newlines, and emojis count as tokens
Counting words instead of tokens when planning a context budget

Debugging Tips

When you see "context length exceeded", look at input + system prompt + history. The output reserve also eats into your limit.

Knowledge Check Questions

Why are token counts provider-specific?
What is the typical ratio of tokens to English words?
Why is output usually pricier than input?

Quiz Questions

Which is most likely to use the most tokens? a) "Hello world" in English b) "Hello world" in Hindi c) "Hello world" as print("hello world") d) All equal Answer: b

Challenge Task

Build a CLI that takes a file and prints (a) token count, (b) estimated input cost, (c) estimated total cost for a chat with 1.5x output ratio at GPT-4o pricing.

Real-world Use Cases

Pre-flight token budget check before calling expensive long-context models
Internal cost dashboards
Multilingual feature pricing decisions

Industry Insight

The fastest cost savings in 2026 production usually come from tokenizer-aware prompt shaping: dropping unnecessary boilerplate, choosing models with better tokenizers for your target language, and using small models for high-frequency calls.

Interview Questions

How do you count tokens for a request before sending it?
How would you reduce token usage for a Hindi-language chatbot?
What is the difference in token count between equivalent text and JSON?

Summary

Tokens are billing units. Count them, budget them, optimize them. Every senior AI engineer thinks in tokens before they think in words.

Lesson 3.2: The anatomy of a prompt: system, user, assistant

Hook / Why This Matters

If you ever wondered why ChatGPT "remembers it is Claude" or "stays in character", it is the system message. Mastering message roles is what separates hobbyists from engineers.

Beginner Analogy

A play has three speakers: the director (system), the audience member who asks (user), the actor who responds (assistant). The director's notes set the stage. The audience asks. The actor delivers, then waits for the next prompt.

Concept Explanation

Modern chat APIs use a list of messages, each with a role:

system: instructions, persona, constraints, format rules. Persistent and high priority.
user: the human's request.
assistant: the model's previous responses, sent back so the model has memory of what it already said.
tool (newer): structured outputs from tool calls.

The API takes the entire array each call. There is no hidden server-side memory unless you build it. You manage memory.

Technical Breakdown

A typical OpenAI call:

from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise tutor. Answer in 2 sentences."},
        {"role": "user", "content": "Explain RAG."},
    ],
)
print(response.choices[0].message.content)

Add the assistant reply back into messages for the next turn. That is conversation memory.

Visual Learning Suggestion

A vertical message list visualization: a settings-icon next to "system", a person-icon next to "user", a sparkles-icon next to "assistant". Show how the array grows turn by turn.

Interactive Element

Open Google AI Studio or OpenAI Playground. Add a system message that says "Always reply in haiku." Send any prompt. Notice the constraint persists across turns.

Hands-on Lab

Write a 5-turn conversation by hand as a JSON array of messages. Include one system, three user, three assistant entries. This is the "wire format" of every chat app.

Mini Exercise

What happens if you put a system instruction inside a user message instead? Is it weaker, stronger, or the same?

Common Mistakes

Forgetting to append assistant replies to history (model "forgets" each turn)
Putting persona inside every user message instead of once in system
Building "memory" via repeated system updates instead of clean message arrays

Debugging Tips

When the model loses persona, check that your system message is still in the array. When it loops, check you are not double-appending.

Knowledge Check Questions

What are the three core message roles?
What does the assistant role represent?
How does the model know what it said earlier?

Quiz Questions

To make a chatbot remember what it said two turns ago, you must: a) Set a flag on the API b) Re-include past assistant and user messages in the array c) Use a different model d) Use a vector database Answer: b

Challenge Task

Build a small script that maintains a chat in a Python list and prints history nicely. No frameworks. Just messages.append(...).

Real-world Use Cases

Chatbots with personas (customer support, tutor, sales)
Multi-step agents that hand off to one another
Tool-using assistants that emit tool messages

Industry Insight

In production, you do not just keep growing the messages array forever. You "truncate", "summarize", or "rolling-window" history. Module 6 will show you how.

Interview Questions

What is the difference between system, user, and assistant messages?
How would you maintain memory across 50 turns without blowing the context window?
Why is system message often more "obeyed" than instructions inside user content?

Summary

A prompt is an ordered list of role-tagged messages, not a single string. System sets the stage, user asks, assistant remembers. You manage the array, the API replays it.

Lesson 3.3: Context windows: what fits, what gets cut, and the lost-in-the-middle effect

Hook / Why This Matters

A 1M-token context window sounds infinite. It is not. And paying to send 800K tokens you do not need is a real, expensive 2026 mistake.

Beginner Analogy

Your short-term memory holds about seven things at once. The model's context window is much bigger, but still finite. And just like you, things in the middle of a long list get forgotten first.

Concept Explanation

The context window is the maximum number of tokens (input + output) the model can consider in one request. Examples in 2026:

GPT-4o: 128K
Claude 4.x family: 200K
Gemini 2.x: 1M (some variants)

When your input exceeds the window, you must truncate, summarize, or use RAG. When the input is huge but fits, expect:

Higher cost and latency.
The "lost in the middle" effect: facts placed in the middle of a very long input are recalled worse than facts at the start or end.

Technical Breakdown

The window includes system + all messages + the response budget you reserve via max_tokens. If you have a 128K window and set max_tokens=8000, your input ceiling is effectively 120K.

Some providers offer "prompt caching" that lowers cost for repeated long prefixes. This is essential when you reuse the same big system prompt for many users.

Visual Learning Suggestion

A long horizontal bar representing the context window, divided into "system", "history", "current user", "response budget". Color the "response budget" differently. Overlay a U-shape "recall accuracy" curve to illustrate lost-in-the-middle.

Interactive Element

Paste a long document (5-10K words) into ChatGPT or Claude. Put a unique factoid at the start, middle, and end. Ask three questions, one targeting each. Note which the model nails and which it misses.

Hands-on Lab

Build a tiny script that takes a file, counts tokens, and warns if it exceeds 80% of the chosen model's window.

Mini Exercise

You have a 200K window. Your system + history is 80K tokens. You want to reserve 8K for the response. What is the max user input token count?

Common Mistakes

Thinking "max_tokens" is the input limit. It is the output reserve.
Stuffing every PDF into one prompt instead of using RAG
Ignoring the lost-in-the-middle effect when designing long prompts

Debugging Tips

If recall is poor on a long input, restructure: put the most important content near the top and bottom, with summaries in the middle.

Knowledge Check Questions

What does the context window include?
What is the lost-in-the-middle effect?
Why is sending huge inputs not a free upgrade just because the window is big?

Quiz Questions

To improve recall on a long document, you should: a) Set temperature to 0 b) Place key facts near the start and end c) Use a smaller model d) Disable streaming Answer: b

Challenge Task

Run a "needle in a haystack" test on any model. Generate a 50K-token text, hide a single unique sentence ("the secret code is 9j2k") at three depths, and measure recall.

Real-world Use Cases

Document QA tools
Long-form coding assistants
Codebase-aware refactoring tools

Industry Insight

2026 prompt design has moved from "stuff everything in" to "select the right 10K tokens via retrieval and put them where the model will see them best". That is RAG plus prompt engineering.

Interview Questions

What is a context window and how do you budget it?
Explain the lost-in-the-middle effect and how to mitigate it.
Why does 1M context not eliminate the need for RAG?

Summary

Context windows are big, finite, and lossy in the middle. Engineer prompts to fit, place critical content well, and prefer retrieval over stuffing.

Lesson 3.4: Multi-turn conversations: how memory really works

Hook / Why This Matters

A chatbot that "remembers" is doing one of three things: re-sending the whole history, summarizing it, or retrieving from a memory store. Knowing which avoids the most common production bugs.

Beginner Analogy

Imagine a meeting where every five minutes a new attendee joins. To keep up, you either replay the whole tape, hand them a summary, or look up what they need from a shared notes app. LLMs have the same three options.

Concept Explanation

Three memory strategies:

Full history (replay): include every prior message. Highest fidelity, highest cost, blows up beyond a few thousand turns.
Rolling window: keep the last N messages. Cheap, loses old facts.
Summarized memory: after every K turns, summarize older history into a single message. Cheap, decent recall.
Retrieval-based memory: store messages in a vector DB, retrieve the most relevant ones at each turn. Scales infinitely, more complex.

Most production chatbots use a hybrid: rolling window for recent turns + summary for older + retrieval for important facts.

Technical Breakdown

For a Streamlit demo, full history fits easily. For an app with thousands of turns, you need summarization. Here is the summarization pattern:

if len(messages) > 20:
    older = messages[:-10]
    summary = summarize(older)  # one LLM call
    messages = [{"role": "system", "content": f"Earlier summary: {summary}"}] + messages[-10:]

Visual Learning Suggestion

Four small diagrams side by side, one per strategy, with arrows showing message flow and a token-cost label.

Interactive Element

Have a 10-turn conversation with ChatGPT where in turn 1 you say "my favorite number is 42". By turn 10 ask "what is my favorite number". Note recall. Then start a new chat (no memory). Note loss.

Hands-on Lab

Extend your Lesson 3.2 chat script to use a rolling window of 10 messages, and warn when older messages get dropped.

Mini Exercise

Why is "summarized memory" not perfect? Where does it fail?

Common Mistakes

Forgetting to persist memory across user sessions (memory dies on refresh)
Mixing memory of multiple users when scaling
Trusting the model to "remember" without sending the history

Debugging Tips

If users complain "it forgot what I said", check your memory strategy. Almost always one of: missing assistant append, rolling window too small, no summarization.

Knowledge Check Questions

Name three memory strategies.
When does a rolling window break down?
Why is retrieval-based memory needed for long-lived agents?

Quiz Questions

For a customer support bot that runs for months per user, the best memory strategy is: a) Full history b) Rolling window c) Retrieval-based d) None, restart each session Answer: c

Challenge Task

Implement summarized memory: after every 10 messages, call the model to compress older turns into a single 200-token summary.

Real-world Use Cases

Customer support chatbots
AI tutors that remember your goals across sessions
Coding copilots that recall your project conventions

Industry Insight

In 2026 the hottest memory pattern is "structured memory": extract facts (preferences, goals, identities) into a small JSON store and re-inject them as system text, instead of replaying raw turns.

Interview Questions

How does a chatbot "remember"?
Compare rolling window vs summarization vs retrieval memory.
How do you isolate memory between users?

Summary

LLMs are stateless. Memory is an application concern. Pick a strategy (replay, window, summary, retrieval) and engineer it deliberately.

Lesson 3.5: Token math and cost estimation

Hook / Why This Matters

The single most important spreadsheet of your AI career is the cost-per-feature estimator. This lesson gives you the formula.

Beginner Analogy

Calculating LLM cost is like calculating a phone bill. Per-minute price times minutes used, plus optional extras. Easy. People skip it and get shocked anyway.

Concept Explanation

Cost per request = (input_tokens / 1,000,000 * input_price) + (output_tokens / 1,000,000 * output_price).

Cost per feature per month = cost per request * requests per user per month * users.

Examples (illustrative 2026 prices):

Model	Input ($/M tok)	Output ($/M tok)
GPT-4o mini	0.15	0.60
GPT-4o	2.50	10.00
Claude Haiku-class	0.25	1.25
Claude Sonnet-class	3.00	15.00
Gemini Flash	0.075	0.30

(Check live prices before committing. These change quarterly.)

Technical Breakdown

Build the estimator in a spreadsheet with columns: model, input price, output price, avg input tokens, avg output tokens, requests/user/month, users, cost/user/month, total/month. Add a "buffer 30%" column for spikes. This is your monthly burn projection.

Visual Learning Suggestion

A stacked bar chart: per-feature monthly cost broken down by model. Useful for showing in a planning meeting.

Interactive Element

Estimate the cost of a chatbot used 10 times per day by 1,000 users, average 300 input + 500 output tokens per request, on GPT-4o-mini. Answer at end.

Hands-on Lab

Build the estimator spreadsheet from scratch. Save it. You will use it for every project in this course.

Mini Exercise

If switching from GPT-4o to GPT-4o-mini drops cost 90% but quality 10%, when is the switch worth it?

Common Mistakes

Forgetting to include the system prompt token count (it ships every request)
Forgetting that long conversation history grows input tokens linearly per turn
Ignoring failed requests (retries cost too)

Debugging Tips

If cost dashboards spike, check three things: prompt size grew, output is longer than expected, traffic spiked. Almost always one of those.

Knowledge Check Questions

How do you compute cost per request?
What two costs grow with conversation length?
When is small-model "good enough" the right call?

Quiz Questions

To halve API spend with least quality loss, your first move is usually: a) Use a more expensive model b) Switch to a smaller, capable model for non-critical paths c) Disable streaming d) Increase max_tokens Answer: b

Challenge Task

Pick a hypothetical product and produce a one-page cost memo with three scenarios (low, expected, high) and a recommendation.

Real-world Use Cases

Pricing your SaaS feature
Choosing between providers
Justifying a model switch to your team

Industry Insight

The 2026 job title "AI cost engineer" did not exist in 2023. It now does. Anyone fluent in token math can become the highest-impact engineer on the team within a quarter.

Interview Questions

Walk me through how you would estimate the cost of a chatbot feature.
What is the typical input/output cost ratio?
How do you control costs in production?

Summary

Cost in LLM apps is deterministic if you count tokens. Build the spreadsheet, monitor it, and the bill never surprises you.

Interactive Element answer: per request = (300/1M * 0.15) + (500/1M * 0.60) = $0.000045 + $0.0003 = ~$0.000345. Per user/month = 10 * 30 * $0.000345 = ~$0.10. 1000 users = ~$100/month.

Module 3 Recap

You can count tokens, design message arrays, plan within a context window, manage multi-turn memory, and estimate cost. You are now equipped to talk like an AI engineer in any planning meeting.

Next Module

Module 4: Prompt Engineering Fundamentals