Module 9: Building a PDF Chatbot (RAG Project)
Module Goal
Ship the flagship project of this course: a working "chat with any PDF" app that uses real RAG, retrieves citations, and lives at a public URL. By the end you can rebuild this in any future job interview from memory.
Estimated Duration
5 to 7 hours.
Skills Learned
- Parsing and chunking PDFs
- Building a persistent vector index
- End-to-end RAG with citations
- Streamlit UI for upload + chat
- Evaluating and improving RAG quality
- Free deployment with secrets
Real-world Importance
PDF chatbots are 2026's most common AI feature request: legal teams, students, founders, support orgs all want them. Knowing how to build one professionally is hireable on its own.
Lessons in this module
- Project setup and architecture review
- Parsing PDFs into clean text
- Chunking strategy that actually works
- Building the vector index
- The retrieve-and-answer flow with citations
- Streamlit UI: upload, chat, citations
- Evaluation: knowing when RAG is "good enough"
- Deployment and stretch goals
Lesson 9.1: Project setup and architecture review
Hook / Why This Matters
Five minutes of planning saves five hours of refactoring. We sketch first, code second.
Beginner Analogy
Before building a kitchen, you draw it. Before building a RAG app, you sketch the pipeline.
Concept Explanation
The architecture (drawn in Module 7.5):
PDFs -> text -> chunks -> embeddings -> ChromaDB
|
user question -> embed -> top-K -> prompt with chunks -> LLM -> answer + citations
Decisions to lock in upfront:
- Embedding model:
text-embedding-3-small(cheap, capable) - LLM:
gpt-4o-mini(cheap, capable) - Vector DB: ChromaDB (local, persistent)
- UI: Streamlit
- Citations: include source filename and page number
Technical Breakdown
Project structure:
pdf-chatbot/
app.py # Streamlit UI
rag.py # ingest + retrieve helpers
prompts.py # system prompt + answer template
data/ # uploaded PDFs (gitignored)
chroma_db/ # vector store (gitignored)
requirements.txt
.env # API keys (gitignored)
.gitignore
README.md
Visual Learning Suggestion
A file-tree diagram next to a runtime architecture diagram. Side by side. Wiring becomes obvious.
Interactive Element
Create the folder structure. Initialize git init, write .gitignore, push an empty repo. 10 minutes.
Hands-on Lab
Set up the project skeleton above. Create empty stub functions in rag.py. Commit.
Mini Exercise
Why is splitting app.py from rag.py worth it even for a small project?
Common Mistakes
- One giant
app.pywith everything mixed (impossible to test) - Skipping
.gitignore(you will leakchroma_dbor.env) - No
README.md(cannot show recruiters)
Debugging Tips
If you cannot draw and explain your structure in 60 seconds, restructure now. It is faster than later.
Knowledge Check Questions
- Why split UI from RAG logic?
- What files should be gitignored?
- Why pick the embedding model upfront?
Quiz Questions
- The vector DB folder should be:
a) Committed to GitHub
b) Gitignored
c) Renamed
d) In
/tmpAnswer: b
Challenge Task
Write the README's "Architecture" section in 200 words with a Markdown diagram.
Real-world Use Cases
- All production AI apps
- Demo projects for jobs
- Open-source RAG starters
Industry Insight
A clean file structure on a GitHub repo is the first signal recruiters scan. Most beginners ship a mess. You will not.
Interview Questions
- Walk me through your project structure.
- Why this split of modules?
- How would you scale this to multi-user?
Summary
Sketch, scaffold, gitignore, README. Then code.
Lesson 9.2: Parsing PDFs into clean text
Hook / Why This Matters
90% of RAG project failures start at the PDF parser. Bad text in = bad retrieval out. We pick a reliable parser and learn to clean.
Beginner Analogy
A blurry book scan vs a crisp ebook. You can read both, but only one is searchable. PDFs need to become the ebook.
Concept Explanation
We will use pypdf (lightweight) or pymupdf (better quality). For tables and scanned PDFs, consider unstructured or marker.
After extraction, we clean: collapse whitespace, drop headers/footers, strip page numbers.
Technical Breakdown
from pypdf import PdfReader
def pdf_to_pages(file_path):
reader = PdfReader(file_path)
pages = []
for i, page in enumerate(reader.pages, start=1):
text = page.extract_text() or ""
text = " ".join(text.split()) # collapse whitespace
pages.append({"page_num": i, "text": text})
return pages
Each page becomes a dict with text and page number. Critical for citations later.
Visual Learning Suggestion
A 3-step "PDF -> raw text -> cleaned text" diagram with messy text on the left and tidy on the right.
Interactive Element
Run the parser on any PDF you have. Print the first 200 chars of pages 1, 5, and 10. Spot any garbage.
Hands-on Lab
Implement pdf_to_pages in rag.py. Test it on a 5-page PDF and a 50-page PDF.
Mini Exercise
What types of PDFs will this parser struggle with?
Common Mistakes
- Concatenating all pages (loses page-level citations)
- Skipping cleaning (junk tokens hurt retrieval)
- Choosing a heavy parser for simple PDFs (slow)
Debugging Tips
If pages come back blank, your PDF may be scanned images. You need OCR (Tesseract or a vision LLM call).
Knowledge Check Questions
- Why store page numbers per chunk?
- When do you need OCR?
- Why clean text before chunking?
Quiz Questions
- A scanned PDF (image-only) needs: a) A better parser b) OCR c) Larger chunks d) A bigger LLM Answer: b
Challenge Task
Add a fallback that uses a vision LLM (GPT-4o or Gemini) to OCR pages where pypdf returns empty text.
Real-world Use Cases
- Legal document QA
- Textbook chatbots
- Receipts and invoice extraction
- Research paper assistants
Industry Insight
The 2026 production trend: most teams now mix pypdf for fast paths and vision LLMs for hard PDFs. It is cheaper than people think and the quality gap is closing.
Interview Questions
- How do you parse a PDF reliably?
- How do you handle scanned PDFs?
- Why preserve page numbers?
Summary
Reliable extraction is the secret to good RAG. Pick pypdf, clean, preserve page numbers, escalate to OCR when needed.
Lesson 9.3: Chunking strategy that actually works
Hook / Why This Matters
Chunking is where RAG projects either succeed or quietly fail. We learn the defaults and the levers.
Beginner Analogy
You are tearing a book into recipe cards. Too small and each card is missing context. Too big and the card has too many recipes mixed. There is a sweet spot.
Concept Explanation
Defaults to start with:
- chunk size: ~500 tokens (about 350 words)
- chunk overlap: 50 to 80 tokens
- prefer chunking by paragraph or sentence, not by character
Always preserve metadata per chunk: source filename, page number, chunk index, document title.
Technical Breakdown
Simple chunker:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
def chunk_text(text, chunk_tokens=500, overlap=80):
tokens = enc.encode(text)
chunks = []
i = 0
while i < len(tokens):
j = min(i + chunk_tokens, len(tokens))
chunks.append(enc.decode(tokens[i:j]))
i += chunk_tokens - overlap
return chunks
Or use LangChain's RecursiveCharacterTextSplitter for paragraph-aware splitting:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, chunk_overlap=300, separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(text)
Visual Learning Suggestion
A bar of tokens with red brackets showing overlapping chunk boundaries. Below it, a "right chunk size" curve showing quality vs size.
Interactive Element
Take a 5-page PDF. Chunk with three settings: (200, 0), (500, 50), (1500, 200). Look at the chunks. Pick the one that "reads sensibly".
Hands-on Lab
Implement chunk_text in rag.py. Process your test PDFs. Print 3 random chunks. Sanity check.
Mini Exercise
Why does overlap prevent answers being cut in half across chunks?
Common Mistakes
- Chunking by raw characters (breaks mid-word)
- Zero overlap (loses cross-chunk answers)
- Different chunk size per document (inconsistent retrieval)
Debugging Tips
If retrieved chunks frequently miss the answer, your chunks are too small or your overlap is too low.
Knowledge Check Questions
- What is chunk overlap?
- Why is paragraph-aware chunking better than character?
- What metadata should every chunk carry?
Quiz Questions
- A good starting chunk size for prose PDFs is around: a) 50 tokens b) 500 tokens c) 5,000 tokens d) Whole document Answer: b
Challenge Task
Add "semantic chunking": split on heading boundaries detected by regex. Compare retrieval quality.
Real-world Use Cases
- All RAG ingests
- Legal contract clause indexing
- Code repo chunking by function
Industry Insight
The 2026 emerging best practice: "agentic chunking" where an LLM proposes chunk boundaries based on meaning. Slower and pricier ingest, dramatically better retrieval.
Interview Questions
- How do you pick chunk size?
- Why is overlap important?
- What is agentic chunking?
Summary
500 tokens, 80 overlap, paragraph-aware, with metadata. Adjust later based on eval results.
Lesson 9.4: Building the vector index
Hook / Why This Matters
This is the lesson where the index goes from idea to a queryable database on disk.
Beginner Analogy
You are now writing all the index cards from the prior lesson into a card catalog drawer, sorted by meaning.
Concept Explanation
For each chunk: compute embedding, store in Chroma with the chunk text and metadata. Reuse the persistent client so the index survives restarts.
Technical Breakdown
import chromadb
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
col = chroma_client.get_or_create_collection("pdfs")
def embed(text):
return openai_client.embeddings.create(
model="text-embedding-3-small", input=text
).data[0].embedding
def index_pdf(file_path):
pages = pdf_to_pages(file_path)
ids, docs, metas, embs = [], [], [], []
for p in pages:
chunks = chunk_text(p["text"])
for i, c in enumerate(chunks):
uid = f"{file_path}-p{p['page_num']}-c{i}"
ids.append(uid)
docs.append(c)
metas.append({
"source": file_path,
"page": p["page_num"],
"chunk_index": i,
})
embs.append(embed(c))
col.add(ids=ids, documents=docs, embeddings=embs, metadatas=metas)
return len(ids)
For larger PDFs, batch the embedding calls (input=[c1, c2, c3, ...]) for speed.
Visual Learning Suggestion
A diagram of "PDF -> chunks list -> embed -> Chroma row inserts" with arrows.
Interactive Element
Index a 50-page PDF. Check Chroma's collection count. Confirm metadata is intact.
Hands-on Lab
Implement index_pdf in rag.py. Add a small CLI: python rag.py ingest data/file.pdf.
Mini Exercise
Why deduplicate ids?
Common Mistakes
- Re-embedding the same chunk twice (duplicate ids fail or replace)
- Forgetting batching (slow for big PDFs)
- Not handling embedding API rate limits
Debugging Tips
If ingest is slow, batch 20 to 100 chunks per embedding call. The API supports it.
Knowledge Check Questions
- Why include
chunk_indexin metadata? - Why use a persistent Chroma client?
- What is batched embedding?
Quiz Questions
- A deterministic chunk id format is best because: a) It is shorter b) It allows re-ingest without duplicates c) It improves embedding quality d) It is required by Chroma Answer: b
Challenge Task
Add a delete_pdf(file_path) function that removes all chunks for a given source.
Real-world Use Cases
- Multi-PDF chatbots
- Continuous-ingest knowledge bases
- Per-user document indices
Industry Insight
In production, chunk ids and metadata schemas are the highest-leverage design decisions in RAG. Get them right early.
Interview Questions
- How do you design chunk ids?
- How do you delete and re-ingest a document?
- How do you handle large ingest jobs?
Summary
Embed each chunk, store with rich metadata, persist Chroma. Index card catalog complete.
Lesson 9.5: The retrieve-and-answer flow with citations
Hook / Why This Matters
This is the lesson where the chatbot actually answers. The trick is forcing citations so users trust the output.
Beginner Analogy
A research assistant who not only answers your question but tells you which book and page they got it from.
Concept Explanation
Flow per query:
- Embed the user question.
- Query Chroma for top-K (e.g., 5) chunks.
- Build a prompt with system rules + retrieved chunks + the user question.
- Tell the model to cite (filename + page) for each claim.
- Stream the answer.
Technical Breakdown
SYSTEM = """You are a PDF chatbot. Use ONLY the provided sources to answer.
If the answer is not in the sources, say "I could not find that in the documents."
After each fact, cite the source like [filename p.X].
"""
def retrieve(question, k=5):
q_emb = embed(question)
res = col.query(query_embeddings=[q_emb], n_results=k)
docs = res["documents"][0]
metas = res["metadatas"][0]
return list(zip(docs, metas))
def answer(question):
pairs = retrieve(question, k=5)
context = "\n\n".join(
f"[Source: {m['source']} p.{m['page']}]\n{d}" for d, m in pairs
)
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f"Sources:\n{context}\n\nQuestion: {question}"},
]
stream = openai_client.chat.completions.create(
model="gpt-4o-mini", messages=messages, stream=True
)
text = ""
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
text += delta
yield delta
return text
Visual Learning Suggestion
A 5-step flow with annotations: embed question, top-K chunks, prompt assembly, LLM call, stream out.
Interactive Element
Run the function on a small index. Ask 3 questions. Verify the citations are real.
Hands-on Lab
Implement retrieve and answer in rag.py. Try 5 questions. Note false citations if any.
Mini Exercise
Why does the system prompt explicitly say "do not invent answers"?
Common Mistakes
- No "I do not know" instruction (the model will invent)
- Forgetting to include the citation format in the system prompt
- Top-K too high (noisy context, cost spikes) or too low (misses answer)
Debugging Tips
If citations look wrong, they probably are. Add a post-step that verifies the cited page actually contains the claim's keywords. This catches the worst hallucinations.
Knowledge Check Questions
- Why include sources in the prompt with explicit tags?
- Why instruct the model to refuse if missing?
- What does K control?
Quiz Questions
- To reduce hallucinations in RAG, the most reliable move is to: a) Use a larger LLM b) Instruct refusal when sources lack the answer c) Increase temperature d) Skip citations Answer: b
Challenge Task
Add a "show sources" expander in the UI that displays the actual retrieved chunks for each answer.
Real-world Use Cases
- Document QA with audit trails
- Compliance bots
- Internal knowledge assistants
Industry Insight
In 2026 enterprise, citations are non-negotiable. Buyers reject any RAG product that cannot show its sources.
Interview Questions
- How do you enforce citations in RAG?
- How do you reduce hallucinations?
- How do you set K?
Summary
Retrieve, augment, generate, cite. The four moves of a trustworthy RAG answer.
Lesson 9.6: Streamlit UI: upload, chat, citations
Hook / Why This Matters
Wiring up the UI is what turns code into a product anyone can use. Two hours and you have a real app.
Beginner Analogy
You wrote the engine. Now you bolt on the steering wheel, the seat, the dashboard.
Concept Explanation
UI pieces:
- Sidebar PDF uploader (calls ingest)
- Chat input and history
- Streaming responses
- Source citations expander
Technical Breakdown
import streamlit as st
from rag import index_pdf, answer
st.title("PDF Chatbot")
with st.sidebar:
uploaded = st.file_uploader("Upload PDF", type=["pdf"])
if uploaded:
path = f"data/{uploaded.name}"
with open(path, "wb") as f:
f.write(uploaded.read())
n = index_pdf(path)
st.success(f"Indexed {n} chunks from {uploaded.name}")
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if prompt := st.chat_input("Ask about your PDFs"):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
placeholder = st.empty()
text = ""
for delta in answer(prompt):
text += delta
placeholder.markdown(text + "_")
placeholder.markdown(text)
st.session_state.messages.append({"role": "assistant", "content": text})
That is the whole app.
Visual Learning Suggestion
Annotated screenshot of the final UI with arrows pointing to uploader, chat input, chat history, and citations.
Interactive Element
Run it. Upload your own PDF. Ask 3 questions. Take a screenshot.
Hands-on Lab
Build the UI. Upload a PDF of your choice (your resume, a textbook chapter). Ask 5 questions. Share the screenshot in the GeekHub community.
Mini Exercise
Why is the placeholder pattern needed for streaming?
Common Mistakes
- Forgetting
os.makedirs("data", exist_ok=True) - Not showing ingest progress for large PDFs
- Allowing huge file uploads (cap with Streamlit config)
Debugging Tips
If users complain "it does not see my upload", check that ingest finished before they asked a question. Add an indicator.
Knowledge Check Questions
- Why use
st.session_statehere? - What does the placeholder do?
- How is the file persisted before ingest?
Quiz Questions
- The sidebar uploader's main job is to: a) Show chat history b) Trigger ingest into the vector DB c) Render answers d) Set the API key Answer: b
Challenge Task
Add a "Clear all PDFs" button that wipes Chroma and clears uploads.
Real-world Use Cases
- Document Q&A products
- Internal knowledge bots
- Education tools
Industry Insight
A clean, focused UI for a single use case beats a feature-rich messy app every time. Resist scope creep.
Interview Questions
- Walk me through the data flow on a single user question.
- How do you handle uploads safely?
- How would you add multi-user separation?
Summary
Sidebar upload, chat input, streaming, citations. The whole app is one screen.
Lesson 9.7: Evaluation: knowing when RAG is "good enough"
Hook / Why This Matters
Without an eval set, you ship vibes. With one, you ship products. This lesson hands you the minimum viable eval.
Beginner Analogy
A chef who never tastes their food cannot improve. A RAG engineer who never evals their app is the same.
Concept Explanation
Build a 20-question eval set with known answers and expected source pages. After every change, re-run, compare. Track three metrics:
- Answer correctness (manual or LLM-judge)
- Source correctness (did it cite the right pages?)
- Faithfulness (did it stick to the sources, or hallucinate?)
Technical Breakdown
import json
EVAL = json.load(open("eval.json")) # list of {q, expected_page, expected_kw}
for case in EVAL:
out = "".join(answer(case["q"]))
citation_ok = f"p.{case['expected_page']}" in out
keyword_ok = case["expected_kw"].lower() in out.lower()
print(case["q"], citation_ok, keyword_ok)
For deeper eval, use Ragas, TruLens, or Promptfoo.
Visual Learning Suggestion
A table of eval results with green/red dots per case and a final score.
Interactive Element
Build 20 questions about a PDF you indexed. Score your current app.
Hands-on Lab
Build the eval script. Iterate one improvement (different chunk size, larger K, different LLM). Re-run. Compare.
Mini Exercise
Why is "faithfulness" measured separately from "correctness"?
Common Mistakes
- Skipping eval ("looks fine to me" is not a metric)
- Single-shot eval after one big change (cannot attribute improvement)
- Letting evals get stale (re-add new failure cases as you find them)
Debugging Tips
When users report a bad answer, add that question to the eval set immediately. Your set grows with your product.
Knowledge Check Questions
- Name 3 RAG eval metrics.
- Why iterate one change at a time?
- What is faithfulness?
Quiz Questions
- The most important RAG eval metric for trust is: a) Latency b) Cost c) Faithfulness (not hallucinating) d) Token count Answer: c
Challenge Task
Build an LLM-as-judge that grades faithfulness on a 1-5 scale for 20 answers.
Real-world Use Cases
- Pre-launch checks
- Model upgrade regression
- Continuous quality monitoring
Industry Insight
The fastest RAG career growth happens to engineers who own eval. Without it, all "improvements" are guesses.
Interview Questions
- How do you evaluate a RAG system?
- What is LLM-as-judge?
- How do you prevent regression?
Summary
20-question eval set, three metrics, iterate one change at a time. This is what separates engineers from prompt-tinkerers.
Lesson 9.8: Deployment and stretch goals
Hook / Why This Matters
The capstone moment: ship the PDF chatbot live. Stretch goals turn it from "tutorial project" into "founder-able product".
Beginner Analogy
You built a working bicycle. Now you go ride it in public.
Concept Explanation
Deploy on Streamlit Cloud as in Module 6.6. Add secrets for OPENAI_API_KEY. Add a requirements.txt:
streamlit
openai
chromadb
pypdf
tiktoken
python-dotenv
Stretch goals:
- Multi-PDF library with per-document selection
- Hybrid search (vector + BM25)
- Reranker (Cohere Rerank or Voyage Rerank)
- Auth via Supabase
- Persistent user history
- Conversational follow-ups ("based on my last question...")
- Image extraction from PDFs (vision LLM call)
- Cost meter in the sidebar
- "Suggested questions" generator from indexed content
- Multi-language support
Technical Breakdown
Hybrid search example:
from rank_bm25 import BM25Okapi
docs = [c["text"] for c in all_chunks]
bm25 = BM25Okapi([d.split() for d in docs])
def hybrid(question, k=10):
vec_top = retrieve(question, k=k)
kw_top = bm25.get_top_n(question.split(), docs, n=k)
return dedupe_and_rerank(vec_top, kw_top)
Visual Learning Suggestion
A "v1 -> v2 -> v3" roadmap with the 10 stretch goals on the right side.
Interactive Element
Deploy. Share the URL. Get one friend to try and tell you the first thing they wished worked differently.
Hands-on Lab
Deploy. Pick one stretch goal. Ship it within a week. Update README.
Mini Exercise
What is the smallest stretch goal that adds the most user trust?
Common Mistakes
- Deploying without ingest progress UI (users assume it broke)
- Skipping cost meter (one user with a 1000-page PDF can shock you)
- Picking 4 stretch goals at once (none ship)
Debugging Tips
If your deployed app fails on large PDFs, you may be exceeding Streamlit Cloud memory. Switch to Hugging Face Spaces (more RAM) or a paid tier.
Knowledge Check Questions
- Why a cost meter?
- Why hybrid search?
- Why a reranker?
Quiz Questions
- The single highest-trust stretch goal is usually: a) Custom CSS b) Reliable citations and source view c) Multi-PDF d) Auth Answer: b
Challenge Task
Ship 3 stretch goals over 3 weekends. Document each in a separate PR.
Real-world Use Cases
- Founder MVPs in legal, education, support
- Internal company tools
- Public-facing AI utilities
Industry Insight
A polished deployed RAG product on your GitHub is hireable on its own in 2026. Many junior AI roles ask for exactly this artifact.
Interview Questions
- Walk me through your PDF chatbot architecture.
- What was the hardest bug?
- How would you scale this to 1000 concurrent users?
Summary
Deploy v1. Add one stretch at a time. The capstone of this course lives here.
Module 9 Recap
You shipped a real RAG-powered PDF chatbot. Citations work. Deploy is live. Your GitHub is now stronger than 90% of bootcamp grads.
SEO Notes
- Primary keyword: "build PDF chatbot Python"
- Long-tail targets: "RAG PDF chatbot tutorial", "Streamlit chatbot PDF", "ChromaDB RAG"
- Schema: HowTo for the full project
- Internal links: Modules 7, 8 (theory), Module 10 (deeper deploy), Module 11 (safety in RAG)