Lesson 9.2: Parsing PDFs into clean text | GeekHub Learn

90% of RAG project failures start at the PDF parser. Bad text in = bad retrieval out. We pick a reliable parser and learn to clean.

A blurry book scan vs a crisp ebook. You can read both, but only one is searchable. PDFs need to become the ebook.

We will use pypdf (lightweight) or pymupdf (better quality). For tables and scanned PDFs, consider unstructured or marker.

After extraction, we clean: collapse whitespace, drop headers/footers, strip page numbers.

from pypdf import PdfReader

def pdf_to_pages(file_path):
    reader = PdfReader(file_path)
    pages = []
    for i, page in enumerate(reader.pages, start=1):
        text = page.extract_text() or ""
        text = " ".join(text.split())  # collapse whitespace
        pages.append({"page_num": i, "text": text})
    return pages

Each page becomes a dict with text and page number. Critical for citations later.

Visualize it

A 3-step "PDF -> raw text -> cleaned text" diagram with messy text on the left and tidy on the right.

Try it now

Run the parser on any PDF you have. Print the first 200 chars of pages 1, 5, and 10. Spot any garbage.

Hands-on lab

Implement pdf_to_pages in rag.py. Test it on a 5-page PDF and a 50-page PDF.

Try it now

What types of PDFs will this parser struggle with?

Common mistakes

Concatenating all pages (loses page-level citations)
Skipping cleaning (junk tokens hurt retrieval)
Choosing a heavy parser for simple PDFs (slow)

Debugging tip

If pages come back blank, your PDF may be scanned images. You need OCR (Tesseract or a vision LLM call).

Challenge

Add a fallback that uses a vision LLM (GPT-4o or Gemini) to OCR pages where pypdf returns empty text.

Where this shows up

Legal document QA
Textbook chatbots
Receipts and invoice extraction
Research paper assistants

From the field

The 2026 production trend: most teams now mix pypdf for fast paths and vision LLMs for hard PDFs. It is cheaper than people think and the quality gap is closing.

Recap

Reliable extraction is the secret to good RAG. Pick pypdf, clean, preserve page numbers, escalate to OCR when needed.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

Why store page numbers per chunk?

Quiz time

1 question · tap an answer to check it

1. A scanned PDF (image-only) needs