Lesson 2.4: Next-token prediction: ChatGPT is really autocomplete | GeekHub Learn

The whole magical thing is a glorified autocomplete. Once that lands, everything from hallucinations to creativity makes sense.

You type "I am going to the gym to..." and your phone suggests "work out", "lift weights", "exercise". That is next-token prediction. ChatGPT does the same, but with 96 layers of context and a vocabulary of 200,000 tokens.

At every step, the model outputs a probability for each token in its vocabulary. The token with the highest probability is the "most likely next token". The system picks one (often not strictly the top, see "temperature") and appends it. Then it does it again. And again. Until a stop token.

The seemingly intelligent paragraph is really 800 sequential autocompletes.

The model emits a vector of "logits" of length vocab_size. A softmax turns it into a probability distribution. Sampling strategies include:

Greedy (temperature 0): always pick the top token. Deterministic, sometimes repetitive.
Temperature sampling: divide logits by T and sample. T > 1 is wilder, T < 1 is more conservative.
Top-k: only sample from the top k tokens.
Top-p (nucleus): only sample from tokens whose cumulative probability adds to p.

Visualize it

A bar chart of next-token probabilities after the prompt "The capital of France is". One huge bar for "Paris", small bars for "the", "Lyon", "Marseille". Adjust the chart for "The capital of France is famous for its..." and watch the distribution flatten.

Try it now

In the OpenAI Playground or Google AI Studio, run the same prompt twice with temperature 0 (identical answers) and twice with temperature 1 (different answers). Feel the difference.

Hands-on lab

Write a one-line prompt about your hometown. Run it five times at temperature 1. Observe the variation. Now set temperature to 0 and run again. Note that the outputs are identical (modulo provider non-determinism).

Try it now

If you set temperature to 0 and still see different responses across runs, what is one possible reason?

Common mistakes

Believing temperature 0 guarantees identical outputs across providers. It usually does not.
Cranking temperature to "make the model creative" instead of using prompt design.
Forgetting that the model is choosing tokens, not facts.

Debugging tip

If you need exact, reproducible structured outputs, use temperature 0 plus structured output mode (JSON schema). If you need creative variation, temperature 0.7 to 1.0 is the sweet spot.

Challenge

Build a one-shot "creative slogan generator" prompt. Optimize for variety with sampling settings rather than longer prompts.

Where this shows up

Streaming chat: each token is sent to the UI as it is sampled
Cost control: stop tokens save money on long outputs
Reliable JSON: temperature 0 plus JSON mode

From the field

The single most common 2026 production gotcha: a developer assumes the model "knows" something, ships it, then hallucinations crash the demo. The model never "knows". It samples. Build accordingly.

Recap

ChatGPT is autocomplete with 96 layers. Knowing that everything is just sampling from a probability distribution makes hallucinations, creativity, and reliability all click into place.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

What does temperature control?

Quiz time

1 question · tap an answer to check it

1. Hallucinations occur because