Chunking strategy that actually works
Chunking is where RAG projects either succeed or quietly fail. We learn the defaults and the levers.
You are tearing a book into recipe cards. Too small and each card is missing context. Too big and the card has too many recipes mixed. There is a sweet spot.
Defaults to start with:
- chunk size: ~500 tokens (about 350 words)
- chunk overlap: 50 to 80 tokens
- prefer chunking by paragraph or sentence, not by character
Always preserve metadata per chunk: source filename, page number, chunk index, document title.
Simple chunker:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
def chunk_text(text, chunk_tokens=500, overlap=80):
tokens = enc.encode(text)
chunks = []
i = 0
while i < len(tokens):
j = min(i + chunk_tokens, len(tokens))
chunks.append(enc.decode(tokens[i:j]))
i += chunk_tokens - overlap
return chunks
Or use LangChain's RecursiveCharacterTextSplitter for paragraph-aware splitting:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, chunk_overlap=300, separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(text)
Quick recall
3 prompts · think before you flip
Prompt 1 of 3
What is chunk overlap?
Quiz time
1 question · tap an answer to check it
1. A good starting chunk size for prose PDFs is around
Finished lesson 9.3?
Mark complete to update your module progress and unlock the streak.
Loading