Building the vector index
This is the lesson where the index goes from idea to a queryable database on disk.
You are now writing all the index cards from the prior lesson into a card catalog drawer, sorted by meaning.
For each chunk: compute embedding, store in Chroma with the chunk text and metadata. Reuse the persistent client so the index survives restarts.
import chromadb
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
col = chroma_client.get_or_create_collection("pdfs")
def embed(text):
return openai_client.embeddings.create(
model="text-embedding-3-small", input=text
).data[0].embedding
def index_pdf(file_path):
pages = pdf_to_pages(file_path)
ids, docs, metas, embs = [], [], [], []
for p in pages:
chunks = chunk_text(p["text"])
for i, c in enumerate(chunks):
uid = f"{file_path}-p{p['page_num']}-c{i}"
ids.append(uid)
docs.append(c)
metas.append({
"source": file_path,
"page": p["page_num"],
"chunk_index": i,
})
embs.append(embed(c))
col.add(ids=ids, documents=docs, embeddings=embs, metadatas=metas)
return len(ids)
For larger PDFs, batch the embedding calls (input=[c1, c2, c3, ...]) for speed.
Quick recall
3 prompts · think before you flip
Prompt 1 of 3
Why include `chunk_index` in metadata?
Quiz time
1 question · tap an answer to check it
1. A deterministic chunk id format is best because
Finished lesson 9.4?
Mark complete to update your module progress and unlock the streak.
Loading