GeekHub Learn
Module
Lesson 9.44 of 8 in this module2 min read Module 9: Building a PDF Chatbot (RAG Project)

Building the vector index

This is the lesson where the index goes from idea to a queryable database on disk.

You are now writing all the index cards from the prior lesson into a card catalog drawer, sorted by meaning.

For each chunk: compute embedding, store in Chroma with the chunk text and metadata. Reuse the persistent client so the index survives restarts.

import chromadb
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
col = chroma_client.get_or_create_collection("pdfs")

def embed(text):
    return openai_client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

def index_pdf(file_path):
    pages = pdf_to_pages(file_path)
    ids, docs, metas, embs = [], [], [], []
    for p in pages:
        chunks = chunk_text(p["text"])
        for i, c in enumerate(chunks):
            uid = f"{file_path}-p{p['page_num']}-c{i}"
            ids.append(uid)
            docs.append(c)
            metas.append({
                "source": file_path,
                "page": p["page_num"],
                "chunk_index": i,
            })
            embs.append(embed(c))
    col.add(ids=ids, documents=docs, embeddings=embs, metadatas=metas)
    return len(ids)

For larger PDFs, batch the embedding calls (input=[c1, c2, c3, ...]) for speed.

Visualize it

A diagram of "PDF -> chunks list -> embed -> Chroma row inserts" with arrows.

Try it now

Index a 50-page PDF. Check Chroma's collection count. Confirm metadata is intact.

Hands-on lab

Implement index_pdf in rag.py. Add a small CLI: python rag.py ingest data/file.pdf.

Try it now

Why deduplicate ids?

Common mistakes

  • Re-embedding the same chunk twice (duplicate ids fail or replace)
  • Forgetting batching (slow for big PDFs)
  • Not handling embedding API rate limits

Debugging tip

If ingest is slow, batch 20 to 100 chunks per embedding call. The API supports it.

Challenge

Add a delete_pdf(file_path) function that removes all chunks for a given source.

Where this shows up

  • Multi-PDF chatbots
  • Continuous-ingest knowledge bases
  • Per-user document indices

From the field

In production, chunk ids and metadata schemas are the highest-leverage design decisions in RAG. Get them right early.

Recap

Embed each chunk, store with rich metadata, persist Chroma. Index card catalog complete.


Quick recall

3 prompts · think before you flip

Prompt 1 of 3

Why include `chunk_index` in metadata?

Quiz time

1 question · tap an answer to check it

  1. 1. A deterministic chunk id format is best because

Finished lesson 9.4?

Mark complete to update your module progress and unlock the streak.

Loading