Module 9: Building a PDF Chatbot (RAG Project)

Module Goal

Ship the flagship project of this course: a working "chat with any PDF" app that uses real RAG, retrieves citations, and lives at a public URL. By the end you can rebuild this in any future job interview from memory.

Estimated Duration

5 to 7 hours.

Skills Learned

Parsing and chunking PDFs
Building a persistent vector index
End-to-end RAG with citations
Streamlit UI for upload + chat
Evaluating and improving RAG quality
Free deployment with secrets

Real-world Importance

PDF chatbots are 2026's most common AI feature request: legal teams, students, founders, support orgs all want them. Knowing how to build one professionally is hireable on its own.

Lessons in this module

Project setup and architecture review
Parsing PDFs into clean text
Chunking strategy that actually works
Building the vector index
The retrieve-and-answer flow with citations
Streamlit UI: upload, chat, citations
Evaluation: knowing when RAG is "good enough"
Deployment and stretch goals

Lesson 9.1: Project setup and architecture review

Hook / Why This Matters

Five minutes of planning saves five hours of refactoring. We sketch first, code second.

Beginner Analogy

Before building a kitchen, you draw it. Before building a RAG app, you sketch the pipeline.

Concept Explanation

The architecture (drawn in Module 7.5):

PDFs -> text -> chunks -> embeddings -> ChromaDB
                                         |
user question -> embed -> top-K -> prompt with chunks -> LLM -> answer + citations

Decisions to lock in upfront:

Embedding model: text-embedding-3-small (cheap, capable)
LLM: gpt-4o-mini (cheap, capable)
Vector DB: ChromaDB (local, persistent)
UI: Streamlit
Citations: include source filename and page number

Technical Breakdown

Project structure:

pdf-chatbot/
  app.py            # Streamlit UI
  rag.py            # ingest + retrieve helpers
  prompts.py        # system prompt + answer template
  data/             # uploaded PDFs (gitignored)
  chroma_db/        # vector store (gitignored)
  requirements.txt
  .env              # API keys (gitignored)
  .gitignore
  README.md

Visual Learning Suggestion

A file-tree diagram next to a runtime architecture diagram. Side by side. Wiring becomes obvious.

Interactive Element

Create the folder structure. Initialize git init, write .gitignore, push an empty repo. 10 minutes.

Hands-on Lab

Set up the project skeleton above. Create empty stub functions in rag.py. Commit.

Mini Exercise

Why is splitting app.py from rag.py worth it even for a small project?

Common Mistakes

One giant app.py with everything mixed (impossible to test)
Skipping .gitignore (you will leak chroma_db or .env)
No README.md (cannot show recruiters)

Debugging Tips

If you cannot draw and explain your structure in 60 seconds, restructure now. It is faster than later.

Knowledge Check Questions

Why split UI from RAG logic?
What files should be gitignored?
Why pick the embedding model upfront?

Quiz Questions

The vector DB folder should be: a) Committed to GitHub b) Gitignored c) Renamed d) In /tmp Answer: b

Challenge Task

Write the README's "Architecture" section in 200 words with a Markdown diagram.

Real-world Use Cases

All production AI apps
Demo projects for jobs
Open-source RAG starters

Industry Insight

A clean file structure on a GitHub repo is the first signal recruiters scan. Most beginners ship a mess. You will not.

Interview Questions

Walk me through your project structure.
Why this split of modules?
How would you scale this to multi-user?

Summary

Sketch, scaffold, gitignore, README. Then code.

Lesson 9.2: Parsing PDFs into clean text

Hook / Why This Matters

90% of RAG project failures start at the PDF parser. Bad text in = bad retrieval out. We pick a reliable parser and learn to clean.

Beginner Analogy

A blurry book scan vs a crisp ebook. You can read both, but only one is searchable. PDFs need to become the ebook.

Concept Explanation

We will use pypdf (lightweight) or pymupdf (better quality). For tables and scanned PDFs, consider unstructured or marker.

After extraction, we clean: collapse whitespace, drop headers/footers, strip page numbers.

Technical Breakdown

from pypdf import PdfReader

def pdf_to_pages(file_path):
    reader = PdfReader(file_path)
    pages = []
    for i, page in enumerate(reader.pages, start=1):
        text = page.extract_text() or ""
        text = " ".join(text.split())  # collapse whitespace
        pages.append({"page_num": i, "text": text})
    return pages

Each page becomes a dict with text and page number. Critical for citations later.

Visual Learning Suggestion

A 3-step "PDF -> raw text -> cleaned text" diagram with messy text on the left and tidy on the right.

Interactive Element

Run the parser on any PDF you have. Print the first 200 chars of pages 1, 5, and 10. Spot any garbage.

Hands-on Lab

Implement pdf_to_pages in rag.py. Test it on a 5-page PDF and a 50-page PDF.

Mini Exercise

What types of PDFs will this parser struggle with?

Common Mistakes

Concatenating all pages (loses page-level citations)
Skipping cleaning (junk tokens hurt retrieval)
Choosing a heavy parser for simple PDFs (slow)

Debugging Tips

If pages come back blank, your PDF may be scanned images. You need OCR (Tesseract or a vision LLM call).

Knowledge Check Questions

Why store page numbers per chunk?
When do you need OCR?
Why clean text before chunking?

Quiz Questions

A scanned PDF (image-only) needs: a) A better parser b) OCR c) Larger chunks d) A bigger LLM Answer: b

Challenge Task

Add a fallback that uses a vision LLM (GPT-4o or Gemini) to OCR pages where pypdf returns empty text.

Real-world Use Cases

Legal document QA
Textbook chatbots
Receipts and invoice extraction
Research paper assistants

Industry Insight

The 2026 production trend: most teams now mix pypdf for fast paths and vision LLMs for hard PDFs. It is cheaper than people think and the quality gap is closing.

Interview Questions

How do you parse a PDF reliably?
How do you handle scanned PDFs?
Why preserve page numbers?

Summary

Reliable extraction is the secret to good RAG. Pick pypdf, clean, preserve page numbers, escalate to OCR when needed.

Lesson 9.3: Chunking strategy that actually works

Hook / Why This Matters

Chunking is where RAG projects either succeed or quietly fail. We learn the defaults and the levers.

Beginner Analogy

You are tearing a book into recipe cards. Too small and each card is missing context. Too big and the card has too many recipes mixed. There is a sweet spot.

Concept Explanation

Defaults to start with:

chunk size: ~500 tokens (about 350 words)
chunk overlap: 50 to 80 tokens
prefer chunking by paragraph or sentence, not by character

Always preserve metadata per chunk: source filename, page number, chunk index, document title.

Technical Breakdown

Simple chunker:

import tiktoken
enc = tiktoken.get_encoding("o200k_base")

def chunk_text(text, chunk_tokens=500, overlap=80):
    tokens = enc.encode(text)
    chunks = []
    i = 0
    while i < len(tokens):
        j = min(i + chunk_tokens, len(tokens))
        chunks.append(enc.decode(tokens[i:j]))
        i += chunk_tokens - overlap
    return chunks

Or use LangChain's RecursiveCharacterTextSplitter for paragraph-aware splitting:

from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000, chunk_overlap=300, separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(text)

Visual Learning Suggestion

A bar of tokens with red brackets showing overlapping chunk boundaries. Below it, a "right chunk size" curve showing quality vs size.

Interactive Element

Take a 5-page PDF. Chunk with three settings: (200, 0), (500, 50), (1500, 200). Look at the chunks. Pick the one that "reads sensibly".

Hands-on Lab

Implement chunk_text in rag.py. Process your test PDFs. Print 3 random chunks. Sanity check.

Mini Exercise

Why does overlap prevent answers being cut in half across chunks?

Common Mistakes

Chunking by raw characters (breaks mid-word)
Zero overlap (loses cross-chunk answers)
Different chunk size per document (inconsistent retrieval)

Debugging Tips

If retrieved chunks frequently miss the answer, your chunks are too small or your overlap is too low.

Knowledge Check Questions

What is chunk overlap?
Why is paragraph-aware chunking better than character?
What metadata should every chunk carry?

Quiz Questions

A good starting chunk size for prose PDFs is around: a) 50 tokens b) 500 tokens c) 5,000 tokens d) Whole document Answer: b

Challenge Task

Add "semantic chunking": split on heading boundaries detected by regex. Compare retrieval quality.

Real-world Use Cases

All RAG ingests
Legal contract clause indexing
Code repo chunking by function

Industry Insight

The 2026 emerging best practice: "agentic chunking" where an LLM proposes chunk boundaries based on meaning. Slower and pricier ingest, dramatically better retrieval.

Interview Questions

How do you pick chunk size?
Why is overlap important?
What is agentic chunking?

Summary

500 tokens, 80 overlap, paragraph-aware, with metadata. Adjust later based on eval results.

Lesson 9.4: Building the vector index

Hook / Why This Matters

This is the lesson where the index goes from idea to a queryable database on disk.

Beginner Analogy

You are now writing all the index cards from the prior lesson into a card catalog drawer, sorted by meaning.

Concept Explanation

For each chunk: compute embedding, store in Chroma with the chunk text and metadata. Reuse the persistent client so the index survives restarts.

Technical Breakdown

import chromadb
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
col = chroma_client.get_or_create_collection("pdfs")

def embed(text):
    return openai_client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

def index_pdf(file_path):
    pages = pdf_to_pages(file_path)
    ids, docs, metas, embs = [], [], [], []
    for p in pages:
        chunks = chunk_text(p["text"])
        for i, c in enumerate(chunks):
            uid = f"{file_path}-p{p['page_num']}-c{i}"
            ids.append(uid)
            docs.append(c)
            metas.append({
                "source": file_path,
                "page": p["page_num"],
                "chunk_index": i,
            })
            embs.append(embed(c))
    col.add(ids=ids, documents=docs, embeddings=embs, metadatas=metas)
    return len(ids)

For larger PDFs, batch the embedding calls (input=[c1, c2, c3, ...]) for speed.

Visual Learning Suggestion

A diagram of "PDF -> chunks list -> embed -> Chroma row inserts" with arrows.

Interactive Element

Index a 50-page PDF. Check Chroma's collection count. Confirm metadata is intact.

Hands-on Lab

Implement index_pdf in rag.py. Add a small CLI: python rag.py ingest data/file.pdf.

Mini Exercise

Why deduplicate ids?

Common Mistakes

Re-embedding the same chunk twice (duplicate ids fail or replace)
Forgetting batching (slow for big PDFs)
Not handling embedding API rate limits

Debugging Tips

If ingest is slow, batch 20 to 100 chunks per embedding call. The API supports it.

Knowledge Check Questions

Why include chunk_index in metadata?
Why use a persistent Chroma client?
What is batched embedding?

Quiz Questions

A deterministic chunk id format is best because: a) It is shorter b) It allows re-ingest without duplicates c) It improves embedding quality d) It is required by Chroma Answer: b

Challenge Task

Add a delete_pdf(file_path) function that removes all chunks for a given source.

Real-world Use Cases

Multi-PDF chatbots
Continuous-ingest knowledge bases
Per-user document indices

Industry Insight

In production, chunk ids and metadata schemas are the highest-leverage design decisions in RAG. Get them right early.

Interview Questions

How do you design chunk ids?
How do you delete and re-ingest a document?
How do you handle large ingest jobs?

Summary

Embed each chunk, store with rich metadata, persist Chroma. Index card catalog complete.

Lesson 9.5: The retrieve-and-answer flow with citations

Hook / Why This Matters

This is the lesson where the chatbot actually answers. The trick is forcing citations so users trust the output.

Beginner Analogy

A research assistant who not only answers your question but tells you which book and page they got it from.

Concept Explanation

Flow per query:

Embed the user question.
Query Chroma for top-K (e.g., 5) chunks.
Build a prompt with system rules + retrieved chunks + the user question.
Tell the model to cite (filename + page) for each claim.
Stream the answer.

Technical Breakdown

SYSTEM = """You are a PDF chatbot. Use ONLY the provided sources to answer.
If the answer is not in the sources, say "I could not find that in the documents."
After each fact, cite the source like [filename p.X].
"""

def retrieve(question, k=5):
    q_emb = embed(question)
    res = col.query(query_embeddings=[q_emb], n_results=k)
    docs = res["documents"][0]
    metas = res["metadatas"][0]
    return list(zip(docs, metas))

def answer(question):
    pairs = retrieve(question, k=5)
    context = "\n\n".join(
        f"[Source: {m['source']} p.{m['page']}]\n{d}" for d, m in pairs
    )
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f"Sources:\n{context}\n\nQuestion: {question}"},
    ]
    stream = openai_client.chat.completions.create(
        model="gpt-4o-mini", messages=messages, stream=True
    )
    text = ""
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            text += delta
            yield delta
    return text

Visual Learning Suggestion

A 5-step flow with annotations: embed question, top-K chunks, prompt assembly, LLM call, stream out.

Interactive Element

Run the function on a small index. Ask 3 questions. Verify the citations are real.

Hands-on Lab

Implement retrieve and answer in rag.py. Try 5 questions. Note false citations if any.

Mini Exercise

Why does the system prompt explicitly say "do not invent answers"?

Common Mistakes

No "I do not know" instruction (the model will invent)
Forgetting to include the citation format in the system prompt
Top-K too high (noisy context, cost spikes) or too low (misses answer)

Debugging Tips

If citations look wrong, they probably are. Add a post-step that verifies the cited page actually contains the claim's keywords. This catches the worst hallucinations.

Knowledge Check Questions

Why include sources in the prompt with explicit tags?
Why instruct the model to refuse if missing?
What does K control?

Quiz Questions

To reduce hallucinations in RAG, the most reliable move is to: a) Use a larger LLM b) Instruct refusal when sources lack the answer c) Increase temperature d) Skip citations Answer: b

Challenge Task

Add a "show sources" expander in the UI that displays the actual retrieved chunks for each answer.

Real-world Use Cases

Document QA with audit trails
Compliance bots
Internal knowledge assistants

Industry Insight

In 2026 enterprise, citations are non-negotiable. Buyers reject any RAG product that cannot show its sources.

Interview Questions

How do you enforce citations in RAG?
How do you reduce hallucinations?
How do you set K?

Summary

Retrieve, augment, generate, cite. The four moves of a trustworthy RAG answer.

Lesson 9.6: Streamlit UI: upload, chat, citations

Hook / Why This Matters

Wiring up the UI is what turns code into a product anyone can use. Two hours and you have a real app.

Beginner Analogy

You wrote the engine. Now you bolt on the steering wheel, the seat, the dashboard.

Concept Explanation

UI pieces:

Sidebar PDF uploader (calls ingest)
Chat input and history
Streaming responses
Source citations expander

Technical Breakdown

import streamlit as st
from rag import index_pdf, answer

st.title("PDF Chatbot")

with st.sidebar:
    uploaded = st.file_uploader("Upload PDF", type=["pdf"])
    if uploaded:
        path = f"data/{uploaded.name}"
        with open(path, "wb") as f:
            f.write(uploaded.read())
        n = index_pdf(path)
        st.success(f"Indexed {n} chunks from {uploaded.name}")

if "messages" not in st.session_state:
    st.session_state.messages = []

for msg in st.session_state.messages:
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"])

if prompt := st.chat_input("Ask about your PDFs"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        placeholder = st.empty()
        text = ""
        for delta in answer(prompt):
            text += delta
            placeholder.markdown(text + "_")
        placeholder.markdown(text)
        st.session_state.messages.append({"role": "assistant", "content": text})

That is the whole app.

Visual Learning Suggestion

Annotated screenshot of the final UI with arrows pointing to uploader, chat input, chat history, and citations.

Interactive Element

Run it. Upload your own PDF. Ask 3 questions. Take a screenshot.

Hands-on Lab

Build the UI. Upload a PDF of your choice (your resume, a textbook chapter). Ask 5 questions. Share the screenshot in the GeekHub community.

Mini Exercise

Why is the placeholder pattern needed for streaming?

Common Mistakes

Forgetting os.makedirs("data", exist_ok=True)
Not showing ingest progress for large PDFs
Allowing huge file uploads (cap with Streamlit config)

Debugging Tips

If users complain "it does not see my upload", check that ingest finished before they asked a question. Add an indicator.

Knowledge Check Questions

Why use st.session_state here?
What does the placeholder do?
How is the file persisted before ingest?

Quiz Questions

The sidebar uploader's main job is to: a) Show chat history b) Trigger ingest into the vector DB c) Render answers d) Set the API key Answer: b

Challenge Task

Add a "Clear all PDFs" button that wipes Chroma and clears uploads.

Real-world Use Cases

Document Q&A products
Internal knowledge bots
Education tools

Industry Insight

A clean, focused UI for a single use case beats a feature-rich messy app every time. Resist scope creep.

Interview Questions

Walk me through the data flow on a single user question.
How do you handle uploads safely?
How would you add multi-user separation?

Summary

Sidebar upload, chat input, streaming, citations. The whole app is one screen.

Lesson 9.7: Evaluation: knowing when RAG is "good enough"

Hook / Why This Matters

Without an eval set, you ship vibes. With one, you ship products. This lesson hands you the minimum viable eval.

Beginner Analogy

A chef who never tastes their food cannot improve. A RAG engineer who never evals their app is the same.

Concept Explanation

Build a 20-question eval set with known answers and expected source pages. After every change, re-run, compare. Track three metrics:

Answer correctness (manual or LLM-judge)
Source correctness (did it cite the right pages?)
Faithfulness (did it stick to the sources, or hallucinate?)

Technical Breakdown

import json
EVAL = json.load(open("eval.json"))  # list of {q, expected_page, expected_kw}

for case in EVAL:
    out = "".join(answer(case["q"]))
    citation_ok = f"p.{case['expected_page']}" in out
    keyword_ok = case["expected_kw"].lower() in out.lower()
    print(case["q"], citation_ok, keyword_ok)

For deeper eval, use Ragas, TruLens, or Promptfoo.

Visual Learning Suggestion

A table of eval results with green/red dots per case and a final score.

Interactive Element

Build 20 questions about a PDF you indexed. Score your current app.

Hands-on Lab

Build the eval script. Iterate one improvement (different chunk size, larger K, different LLM). Re-run. Compare.

Mini Exercise

Why is "faithfulness" measured separately from "correctness"?

Common Mistakes

Skipping eval ("looks fine to me" is not a metric)
Single-shot eval after one big change (cannot attribute improvement)
Letting evals get stale (re-add new failure cases as you find them)

Debugging Tips

When users report a bad answer, add that question to the eval set immediately. Your set grows with your product.

Knowledge Check Questions

Name 3 RAG eval metrics.
Why iterate one change at a time?
What is faithfulness?

Quiz Questions

The most important RAG eval metric for trust is: a) Latency b) Cost c) Faithfulness (not hallucinating) d) Token count Answer: c

Challenge Task

Build an LLM-as-judge that grades faithfulness on a 1-5 scale for 20 answers.

Real-world Use Cases

Pre-launch checks
Model upgrade regression
Continuous quality monitoring

Industry Insight

The fastest RAG career growth happens to engineers who own eval. Without it, all "improvements" are guesses.

Interview Questions

How do you evaluate a RAG system?
What is LLM-as-judge?
How do you prevent regression?

Summary

20-question eval set, three metrics, iterate one change at a time. This is what separates engineers from prompt-tinkerers.

Lesson 9.8: Deployment and stretch goals

Hook / Why This Matters

The capstone moment: ship the PDF chatbot live. Stretch goals turn it from "tutorial project" into "founder-able product".

Beginner Analogy

You built a working bicycle. Now you go ride it in public.

Concept Explanation

Deploy on Streamlit Cloud as in Module 6.6. Add secrets for OPENAI_API_KEY. Add a requirements.txt:

streamlit
openai
chromadb
pypdf
tiktoken
python-dotenv

Stretch goals:

Multi-PDF library with per-document selection
Hybrid search (vector + BM25)
Reranker (Cohere Rerank or Voyage Rerank)
Auth via Supabase
Persistent user history
Conversational follow-ups ("based on my last question...")
Image extraction from PDFs (vision LLM call)
Cost meter in the sidebar
"Suggested questions" generator from indexed content
Multi-language support

Technical Breakdown

Hybrid search example:

from rank_bm25 import BM25Okapi
docs = [c["text"] for c in all_chunks]
bm25 = BM25Okapi([d.split() for d in docs])
def hybrid(question, k=10):
    vec_top = retrieve(question, k=k)
    kw_top = bm25.get_top_n(question.split(), docs, n=k)
    return dedupe_and_rerank(vec_top, kw_top)

Visual Learning Suggestion

A "v1 -> v2 -> v3" roadmap with the 10 stretch goals on the right side.

Interactive Element

Deploy. Share the URL. Get one friend to try and tell you the first thing they wished worked differently.

Hands-on Lab

Deploy. Pick one stretch goal. Ship it within a week. Update README.

Mini Exercise

What is the smallest stretch goal that adds the most user trust?

Common Mistakes

Deploying without ingest progress UI (users assume it broke)
Skipping cost meter (one user with a 1000-page PDF can shock you)
Picking 4 stretch goals at once (none ship)

Debugging Tips

If your deployed app fails on large PDFs, you may be exceeding Streamlit Cloud memory. Switch to Hugging Face Spaces (more RAM) or a paid tier.

Knowledge Check Questions

Why a cost meter?
Why hybrid search?
Why a reranker?

Quiz Questions

The single highest-trust stretch goal is usually: a) Custom CSS b) Reliable citations and source view c) Multi-PDF d) Auth Answer: b

Challenge Task

Ship 3 stretch goals over 3 weekends. Document each in a separate PR.

Real-world Use Cases

Founder MVPs in legal, education, support
Internal company tools
Public-facing AI utilities

Industry Insight

A polished deployed RAG product on your GitHub is hireable on its own in 2026. Many junior AI roles ask for exactly this artifact.

Interview Questions

Walk me through your PDF chatbot architecture.
What was the hardest bug?
How would you scale this to 1000 concurrent users?

Summary

Deploy v1. Add one stretch at a time. The capstone of this course lives here.

Module 9 Recap

You shipped a real RAG-powered PDF chatbot. Citations work. Deploy is live. Your GitHub is now stronger than 90% of bootcamp grads.

Next Module

Module 10: Deploying AI Apps