Context windows: what fits, what gets cut, and the lost-in-the-middle effect
A 1M-token context window sounds infinite. It is not. And paying to send 800K tokens you do not need is a real, expensive 2026 mistake.
Your short-term memory holds about seven things at once. The model's context window is much bigger, but still finite. And just like you, things in the middle of a long list get forgotten first.
The context window is the maximum number of tokens (input + output) the model can consider in one request. Examples in 2026:
- GPT-4o: 128K
- Claude 4.x family: 200K
- Gemini 2.x: 1M (some variants)
When your input exceeds the window, you must truncate, summarize, or use RAG. When the input is huge but fits, expect:
- Higher cost and latency.
- The "lost in the middle" effect: facts placed in the middle of a very long input are recalled worse than facts at the start or end.
The window includes system + all messages + the response budget you reserve via max_tokens. If you have a 128K window and set max_tokens=8000, your input ceiling is effectively 120K.
Some providers offer "prompt caching" that lowers cost for repeated long prefixes. This is essential when you reuse the same big system prompt for many users.
Quick recall
3 prompts · think before you flip
Prompt 1 of 3
What does the context window include?
Quiz time
1 question · tap an answer to check it
1. To improve recall on a long document, you should
Finished lesson 3.3?
Mark complete to update your module progress and unlock the streak.