GeekHub Learn
Module

Context windows: what fits, what gets cut, and the lost-in-the-middle effect

A 1M-token context window sounds infinite. It is not. And paying to send 800K tokens you do not need is a real, expensive 2026 mistake.

Your short-term memory holds about seven things at once. The model's context window is much bigger, but still finite. And just like you, things in the middle of a long list get forgotten first.

The context window is the maximum number of tokens (input + output) the model can consider in one request. Examples in 2026:

  • GPT-4o: 128K
  • Claude 4.x family: 200K
  • Gemini 2.x: 1M (some variants)

When your input exceeds the window, you must truncate, summarize, or use RAG. When the input is huge but fits, expect:

  1. Higher cost and latency.
  2. The "lost in the middle" effect: facts placed in the middle of a very long input are recalled worse than facts at the start or end.

The window includes system + all messages + the response budget you reserve via max_tokens. If you have a 128K window and set max_tokens=8000, your input ceiling is effectively 120K.

Some providers offer "prompt caching" that lowers cost for repeated long prefixes. This is essential when you reuse the same big system prompt for many users.

Visualize it

A long horizontal bar representing the context window, divided into "system", "history", "current user", "response budget". Color the "response budget" differently. Overlay a U-shape "recall accuracy" curve to illustrate lost-in-the-middle.

Try it now

Paste a long document (5-10K words) into ChatGPT or Claude. Put a unique factoid at the start, middle, and end. Ask three questions, one targeting each. Note which the model nails and which it misses.

Hands-on lab

Build a tiny script that takes a file, counts tokens, and warns if it exceeds 80% of the chosen model's window.

Try it now

You have a 200K window. Your system + history is 80K tokens. You want to reserve 8K for the response. What is the max user input token count?

Common mistakes

  • Thinking "max_tokens" is the input limit. It is the output reserve.
  • Stuffing every PDF into one prompt instead of using RAG
  • Ignoring the lost-in-the-middle effect when designing long prompts

Debugging tip

If recall is poor on a long input, restructure: put the most important content near the top and bottom, with summaries in the middle.

Challenge

Run a "needle in a haystack" test on any model. Generate a 50K-token text, hide a single unique sentence ("the secret code is 9j2k") at three depths, and measure recall.

Where this shows up

  • Document QA tools
  • Long-form coding assistants
  • Codebase-aware refactoring tools

From the field

2026 prompt design has moved from "stuff everything in" to "select the right 10K tokens via retrieval and put them where the model will see them best". That is RAG plus prompt engineering.

Recap

Context windows are big, finite, and lossy in the middle. Engineer prompts to fit, place critical content well, and prefer retrieval over stuffing.


Quick recall

3 prompts · think before you flip

Prompt 1 of 3

What does the context window include?

Quiz time

1 question · tap an answer to check it

  1. 1. To improve recall on a long document, you should

Finished lesson 3.3?

Mark complete to update your module progress and unlock the streak.

Loading