Module 8: Vector Embeddings Simplified

Module Goal

Get fluent with vectors, embeddings, similarity, and the vector database choices you will face in Module 9. By the end, you can compare, ingest, and query embeddings confidently.

Estimated Duration

3 to 4 hours.

Skills Learned

Explaining embeddings without math
Computing cosine similarity
Picking an embedding model
Choosing a vector DB
Running your first embedding queries

Real-world Importance

Embeddings power RAG, semantic search, recommendations, classification, and clustering. Picking the right embedding model often improves a feature more than picking a fancier LLM.

Lessons in this module

What is an embedding, really?
Cosine similarity in 60 seconds
Embedding models in 2026: OpenAI, Voyage, Cohere, open source
Vector databases: when to use what
Your first embedding query in Python

Lesson 8.1: What is an embedding, really?

Hook / Why This Matters

Embeddings are the unsexy plumbing of AI. They run quietly behind ChatGPT, Google Search, Spotify recommendations, and every "find similar" feature you have ever used.

Beginner Analogy

Imagine assigning every word, sentence, or document an address on a giant map. Things with similar meaning live close together. Embeddings are the addresses.

Concept Explanation

A vector embedding is a list of numbers (e.g., 1,536 floats) that represents the meaning of a piece of text. Similar texts have nearby vectors. Different texts have far apart vectors.

Embeddings are produced by neural networks trained on huge corpora to project text into a meaning-aware coordinate system.

Technical Breakdown

from openai import OpenAI
client = OpenAI()
emb = client.embeddings.create(model="text-embedding-3-small", input="hello").data[0].embedding
print(len(emb), emb[:5])

You will get something like 1536 floats. Two embeddings can be compared with cosine similarity.

Visual Learning Suggestion

A 2D scatter plot (after PCA) showing words like "king", "queen", "man", "woman" clustered in meaningful ways. The famous "king minus man plus woman = queen" visual.

Interactive Element

Embed three sentences (yours, your friend's, a random one). Compute pairwise similarity. Note the closest pair matches your intuition.

Hands-on Lab

In Colab, embed 5 sentences of your choice. Print the dimensions. Compute cosine similarity for all pairs. Visualize as a heatmap.

Mini Exercise

Why are embeddings called "dense" representations as opposed to "sparse" ones like keyword vectors?

Common Mistakes

Thinking embeddings are deterministic features (they are learned, model-dependent)
Mixing embeddings from different models (different coordinate systems, incomparable)
Storing embeddings without metadata

Debugging Tips

If similarity scores are weirdly uniform, your text inputs are too generic. Embeddings work best on substantive content (>50 words).

Knowledge Check Questions

What is an embedding?
Why do similar texts have nearby vectors?
Why can you not compare embeddings from different models?

Quiz Questions

An embedding from text-embedding-3-small is a list of: a) 100 floats b) 768 floats c) 1,536 floats d) 4,096 floats Answer: c

Challenge Task

Embed 100 product titles from a CSV. Cluster them with KMeans (sklearn). Print 3 examples from each cluster.

Real-world Use Cases

Semantic search ("find similar")
RAG retrieval
Deduplication
Classification with nearest-centroid

Industry Insight

A 2026 senior tip: a good embedding + a small classifier outperforms a giant LLM on many narrow classification problems, at 100x lower cost.

Interview Questions

What is a vector embedding?
Why are embeddings model-specific?
How would you cluster product reviews?

Summary

Embeddings turn text into meaning-aware coordinates. They are the foundation of every retrieval, similarity, and clustering feature in modern AI.

Lesson 8.2: Cosine similarity in 60 seconds

Hook / Why This Matters

Cosine similarity is the one math concept of this module. It is straightforward, and once you see it, every "vector database" makes sense.

Beginner Analogy

If two arrows point in the same direction, they are similar. If they point opposite ways, they are dissimilar. Cosine similarity is the math of "are these arrows pointing the same way?".

Concept Explanation

Cosine similarity = the cosine of the angle between two vectors. Range: -1 (opposite) to 1 (identical direction). For embeddings, it usually lies between 0 and 1.

Formula: cos(theta) = (a . b) / (|a| * |b|). Implementations are one line in any library.

Technical Breakdown

import numpy as np
def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return float(a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b)))

OpenAI's embeddings are normalized (length 1), so cosine simplifies to just the dot product.

Visual Learning Suggestion

A 2D arrows diagram: two arrows close together (high cosine, ~0.95), two arrows perpendicular (cosine 0), two opposite (-1).

Interactive Element

Compute cosine between "I love pizza" and "I enjoy pasta" (high). Between "I love pizza" and "Quantum physics" (low).

Hands-on Lab

Build a tiny semantic search: 10 sentences embedded, accept a query, return the top-3 by cosine.

Mini Exercise

Why is cosine preferred over Euclidean distance for high-dimensional embeddings?

Common Mistakes

Forgetting to normalize (only matters if your provider returns unnormalized vectors)
Confusing cosine with dot product when vectors are not unit length
Using similarity below 0.7 as "very similar" (rarely useful; depends on model)

Debugging Tips

If similarity scores cluster in a narrow band (e.g., 0.7 to 0.9), that may be normal for your embedding model. Calibrate to your data, do not use absolute thresholds.

Knowledge Check Questions

What is cosine similarity?
What range does it produce?
Why is it preferred for embeddings?

Quiz Questions

Two identical embeddings have cosine similarity: a) 0 b) 0.5 c) 1 d) -1 Answer: c

Challenge Task

Run your top-3 semantic search on 1,000 sentences. Time it. Compare to a naive Python loop vs a NumPy batched implementation.

Real-world Use Cases

Semantic search ranking
Duplicate detection
Clustering and recommendation

Industry Insight

When you scale to millions of vectors, naive cosine is too slow. Vector DBs use Approximate Nearest Neighbor (ANN) indexes (HNSW, IVF) to keep retrieval fast.

Interview Questions

Define cosine similarity.
Why is it preferred for embeddings?
What is ANN search and why is it needed?

Summary

Cosine similarity = direction agreement. It is the universal metric for "find similar" with embeddings.

Lesson 8.3: Embedding models in 2026: OpenAI, Voyage, Cohere, open source

Hook / Why This Matters

A better embedding model can lift RAG quality more than a better LLM. Knowing your options is a quiet superpower.

Beginner Analogy

Choosing an embedding model is like choosing a camera for a photographer. Quality, cost, latency, and specialization vary widely. You match it to the job.

Concept Explanation

Top 2026 embedding model families:

OpenAI (text-embedding-3-small, text-embedding-3-large): well-balanced, well-supported, cheap.
Voyage AI (voyage-3, voyage-large): often top of leaderboards, great for English and code.
Cohere (embed-v3): multilingual strength, hybrid search friendly.
Open source (bge, nomic-embed, mxbai-embed-large): self-hostable, privacy-friendly.
Specialized: code embeddings (Voyage code), legal, biomedical.

Pick based on language coverage, domain, deployment constraints, and cost.

Technical Breakdown

Compare two providers on the same data:

from sentence_transformers import SentenceTransformer
bge = SentenceTransformer("BAAI/bge-base-en-v1.5")
bge_vec = bge.encode("hello world")

vs OpenAI snippet from Lesson 8.1. Same text, different coordinate systems, comparable downstream search quality differences.

Visual Learning Suggestion

A 4-quadrant chart: x-axis cost (low to high), y-axis quality (low to high), with named models placed.

Interactive Element

Read the latest MTEB (Massive Text Embedding Benchmark) leaderboard. Note the top 3 today. Note their license, cost, and language support.

Hands-on Lab

Embed the same 20 sentences with two different models. Run nearest-neighbor for one query. Compare top-3 results.

Mini Exercise

When would you choose an open-source embedding model over a hosted one?

Common Mistakes

Mixing models (incompatible vector spaces)
Choosing by leaderboard alone (your data may behave differently)
Underestimating cost at scale (a few cents per 1K embeddings adds up at 10M chunks)

Debugging Tips

If retrieval quality is poor, switch your embedding model before tuning anything else. Often a one-line change with a big lift.

Knowledge Check Questions

Name two hosted and two open-source embedding model families.
What is MTEB?
When is multilingual support a hard requirement?

Quiz Questions

For a multilingual customer support RAG, you would prefer: a) OpenAI text-embedding-3-small b) Cohere embed-multilingual-v3 c) An English-only BGE model d) Voyage voyage-code-large Answer: b

Challenge Task

Build a "model comparison" notebook: same 30 sentences, two queries, three embedding models. Score top-3 hits and tabulate.

Real-world Use Cases

RAG (foundational)
Semantic product search
Multilingual content matching
Code search

Industry Insight

Production teams now often build "embedding A/B tests" before committing to a model. The cost is days. The payoff is years of retrieval quality.

Interview Questions

How do you pick an embedding model?
What is MTEB?
Why might open-source beat hosted for your use case?

Summary

Embedding models vary in quality, cost, language, and license. Pick deliberately, benchmark on your data, and never mix.

Lesson 8.4: Vector databases: when to use what

Hook / Why This Matters

The first vector DB you pick will probably be wrong. The second will be okay. This lesson saves you from picking the wrong third.

Beginner Analogy

Vector DBs are like databases for "find similar" instead of "find exact". Different ones suit different scales and stacks.

Concept Explanation

Top 2026 vector DB choices:

ChromaDB: simplest local dev. Great for prototypes and modest production loads.
FAISS: Facebook's in-memory library. Fast, library-only (you wrap it).
pgvector: PostgreSQL extension. Best if you already use Postgres (Supabase).
Qdrant: production-grade open source. Self-host or cloud.
Weaviate: production-grade open source. Native hybrid search.
Pinecone: hosted, easy, paid.
Milvus: massive-scale open source.

Pick by deployment style (local vs cloud), scale (1K vs 1B vectors), and stack (already on Postgres? use pgvector).

Technical Breakdown

Local Chroma quickstart:

import chromadb
client = chromadb.Client()
col = client.create_collection("docs")
col.add(ids=["1"], documents=["GeekHub is for developers"], metadatas=[{"src": "site"}])
res = col.query(query_texts=["who is GeekHub for?"], n_results=1)
print(res)

That is RAG retrieval in 5 lines.

Visual Learning Suggestion

A "decision tree" diagram: start at the top with "prototype or production?" then branch to "local or cloud?" then to "existing stack?", landing at a recommended vector DB.

Interactive Element

Install ChromaDB. Run the snippet. Confirm retrieval works.

Hands-on Lab

Build a 50-document Chroma index of your favorite blog posts. Query it. Note retrieval quality.

Mini Exercise

Why is pgvector great for teams already on Postgres or Supabase?

Common Mistakes

Choosing the hottest DB without considering your team's stack
Over-engineering with Pinecone for a prototype
Skipping metadata fields (you cannot filter retrieval later)

Debugging Tips

If retrieval is slow at scale, you may be using exact search. Switch to an ANN index (HNSW).

Knowledge Check Questions

Name 5 vector DB options.
When would you pick pgvector?
What is HNSW?

Quiz Questions

For a Supabase-based app, the natural vector DB choice is: a) Pinecone b) pgvector c) ChromaDB d) Milvus Answer: b

Challenge Task

Build the same 50-document index in two DBs (Chroma and pgvector). Compare ergonomics, speed, and cost.

Real-world Use Cases

All RAG apps
Semantic search
Recommendation systems
Deduplication services

Industry Insight

In 2026 the "best vector DB" depends entirely on your stack. There is no global winner. Engineers who can switch DBs in a week stay valuable.

Interview Questions

How do you choose a vector DB?
What is the difference between exact and approximate nearest neighbor search?
What metadata would you store alongside vectors?

Summary

Five solid options. Match to your stack, scale, and deployment style. Most beginners should start with ChromaDB or pgvector.

Lesson 8.5: Your first embedding query in Python

Hook / Why This Matters

This is the lesson where you make embeddings real. By the end you have a working semantic search over your own data. Module 9 will turn this into RAG.

Beginner Analogy

Today you build a tiny private Google for whatever text you give it. Cool.

Concept Explanation

Three steps:

Embed your documents.
Store the vectors with their text (and ideally metadata).
At query time, embed the question and find nearest vectors.

We will use OpenAI embeddings + ChromaDB. Swap providers later as needed.

Technical Breakdown

import os, glob, chromadb
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

client_oai = OpenAI()
client_chroma = chromadb.PersistentClient(path="./chroma_db")
col = client_chroma.get_or_create_collection("notes")

def embed(text):
    return client_oai.embeddings.create(
        model="text-embedding-3-small", input=text
    ).data[0].embedding

# Ingest .txt files from ./notes
for path in glob.glob("notes/*.txt"):
    text = open(path).read()
    col.add(
        ids=[path],
        documents=[text],
        embeddings=[embed(text)],
        metadatas=[{"source": path}],
    )

# Query
question = "what does GeekHub do?"
results = col.query(query_embeddings=[embed(question)], n_results=3)
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
    print(meta["source"], "->", doc[:200])

50 lines and you have a semantic search engine.

Visual Learning Suggestion

A flow diagram of the script: files in -> embed -> Chroma -> query -> top results out.

Interactive Element

Run the script on your own notes folder. Marvel at retrieval.

Hands-on Lab

Build the script. Add at least 10 documents. Run 5 queries. Save results for Module 9.

Mini Exercise

Why is this not yet RAG?

Common Mistakes

Forgetting to persist Chroma (use PersistentClient)
Embedding the same document twice (deduplicate)
Not storing metadata (you will want filters later)

Debugging Tips

If results are weak, your documents may be too long for one embedding. Chunk them (we will do this in Module 9).

Knowledge Check Questions

What is the difference between semantic search and RAG?
Why store metadata?
What is persistence for vector DBs?

Quiz Questions

The script above is missing what to make it RAG? a) An LLM call that uses the retrieved chunks b) A bigger model c) More documents d) Tokenization Answer: a

Challenge Task

Add a date filter so you can search "notes from the last 30 days only".

Real-world Use Cases

Personal note search
Internal wiki search
Product catalog search
Code snippet retrieval

Industry Insight

This 50-line script is the production seed of dozens of "find similar" features in real products. Master it.

Interview Questions

Build a 5-step semantic search by hand.
How would you add hybrid (keyword + vector) search?
What metadata is most valuable to store?

Summary

Embed, store, query. Three steps to your first semantic search. Module 9 adds the LLM and you have RAG.

Module 8 Recap

Embeddings, cosine similarity, model choices, vector DBs, and your first semantic search are now in your toolkit. You are ready to build a real RAG system.

Next Module

Module 9: Building a PDF Chatbot