Parsing PDFs into clean text
90% of RAG project failures start at the PDF parser. Bad text in = bad retrieval out. We pick a reliable parser and learn to clean.
A blurry book scan vs a crisp ebook. You can read both, but only one is searchable. PDFs need to become the ebook.
We will use pypdf (lightweight) or pymupdf (better quality). For tables and scanned PDFs, consider unstructured or marker.
After extraction, we clean: collapse whitespace, drop headers/footers, strip page numbers.
from pypdf import PdfReader
def pdf_to_pages(file_path):
reader = PdfReader(file_path)
pages = []
for i, page in enumerate(reader.pages, start=1):
text = page.extract_text() or ""
text = " ".join(text.split()) # collapse whitespace
pages.append({"page_num": i, "text": text})
return pages
Each page becomes a dict with text and page number. Critical for citations later.
Quick recall
3 prompts · think before you flip
Prompt 1 of 3
Why store page numbers per chunk?
Quiz time
1 question · tap an answer to check it
1. A scanned PDF (image-only) needs
Finished lesson 9.2?
Mark complete to update your module progress and unlock the streak.