Lesson 9.4: Building the vector index | GeekHub Learn

This is the lesson where the index goes from idea to a queryable database on disk.

You are now writing all the index cards from the prior lesson into a card catalog drawer, sorted by meaning.

For each chunk: compute embedding, store in Chroma with the chunk text and metadata. Reuse the persistent client so the index survives restarts.

import chromadb
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
col = chroma_client.get_or_create_collection("pdfs")

def embed(text):
    return openai_client.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

def index_pdf(file_path):
    pages = pdf_to_pages(file_path)
    ids, docs, metas, embs = [], [], [], []
    for p in pages:
        chunks = chunk_text(p["text"])
        for i, c in enumerate(chunks):
            uid = f"{file_path}-p{p['page_num']}-c{i}"
            ids.append(uid)
            docs.append(c)
            metas.append({
                "source": file_path,
                "page": p["page_num"],
                "chunk_index": i,
            })
            embs.append(embed(c))
    col.add(ids=ids, documents=docs, embeddings=embs, metadatas=metas)
    return len(ids)

For larger PDFs, batch the embedding calls (input=[c1, c2, c3, ...]) for speed.

Visualize it

A diagram of "PDF -> chunks list -> embed -> Chroma row inserts" with arrows.

Try it now

Index a 50-page PDF. Check Chroma's collection count. Confirm metadata is intact.

Hands-on lab

Implement index_pdf in rag.py. Add a small CLI: python rag.py ingest data/file.pdf.

Try it now

Why deduplicate ids?

Common mistakes

Re-embedding the same chunk twice (duplicate ids fail or replace)
Forgetting batching (slow for big PDFs)
Not handling embedding API rate limits

Debugging tip

If ingest is slow, batch 20 to 100 chunks per embedding call. The API supports it.

Challenge

Add a delete_pdf(file_path) function that removes all chunks for a given source.

Where this shows up

Multi-PDF chatbots
Continuous-ingest knowledge bases
Per-user document indices

From the field

In production, chunk ids and metadata schemas are the highest-leverage design decisions in RAG. Get them right early.

Recap

Embed each chunk, store with rich metadata, persist Chroma. Index card catalog complete.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

Why include `chunk_index` in metadata?

Quiz time

1 question · tap an answer to check it

1. A deterministic chunk id format is best because