GeekHub Learn
Module
Lesson 9.33 of 8 in this module2 min read Module 9: Building a PDF Chatbot (RAG Project)

Chunking strategy that actually works

Chunking is where RAG projects either succeed or quietly fail. We learn the defaults and the levers.

You are tearing a book into recipe cards. Too small and each card is missing context. Too big and the card has too many recipes mixed. There is a sweet spot.

Defaults to start with:

  • chunk size: ~500 tokens (about 350 words)
  • chunk overlap: 50 to 80 tokens
  • prefer chunking by paragraph or sentence, not by character

Always preserve metadata per chunk: source filename, page number, chunk index, document title.

Simple chunker:

import tiktoken
enc = tiktoken.get_encoding("o200k_base")

def chunk_text(text, chunk_tokens=500, overlap=80):
    tokens = enc.encode(text)
    chunks = []
    i = 0
    while i < len(tokens):
        j = min(i + chunk_tokens, len(tokens))
        chunks.append(enc.decode(tokens[i:j]))
        i += chunk_tokens - overlap
    return chunks

Or use LangChain's RecursiveCharacterTextSplitter for paragraph-aware splitting:

from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000, chunk_overlap=300, separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(text)

Visualize it

A bar of tokens with red brackets showing overlapping chunk boundaries. Below it, a "right chunk size" curve showing quality vs size.

Try it now

Take a 5-page PDF. Chunk with three settings: (200, 0), (500, 50), (1500, 200). Look at the chunks. Pick the one that "reads sensibly".

Hands-on lab

Implement chunk_text in rag.py. Process your test PDFs. Print 3 random chunks. Sanity check.

Try it now

Why does overlap prevent answers being cut in half across chunks?

Common mistakes

  • Chunking by raw characters (breaks mid-word)
  • Zero overlap (loses cross-chunk answers)
  • Different chunk size per document (inconsistent retrieval)

Debugging tip

If retrieved chunks frequently miss the answer, your chunks are too small or your overlap is too low.

Challenge

Add "semantic chunking": split on heading boundaries detected by regex. Compare retrieval quality.

Where this shows up

  • All RAG ingests
  • Legal contract clause indexing
  • Code repo chunking by function

From the field

The 2026 emerging best practice: "agentic chunking" where an LLM proposes chunk boundaries based on meaning. Slower and pricier ingest, dramatically better retrieval.

Recap

500 tokens, 80 overlap, paragraph-aware, with metadata. Adjust later based on eval results.


Quick recall

3 prompts · think before you flip

Prompt 1 of 3

What is chunk overlap?

Quiz time

1 question · tap an answer to check it

  1. 1. A good starting chunk size for prose PDFs is around

Finished lesson 9.3?

Mark complete to update your module progress and unlock the streak.

Loading