Lesson 8.2: Cosine similarity in 60 seconds | GeekHub Learn

Cosine similarity is the one math concept of this module. It is straightforward, and once you see it, every "vector database" makes sense.

If two arrows point in the same direction, they are similar. If they point opposite ways, they are dissimilar. Cosine similarity is the math of "are these arrows pointing the same way?".

Cosine similarity = the cosine of the angle between two vectors. Range: -1 (opposite) to 1 (identical direction). For embeddings, it usually lies between 0 and 1.

Formula: cos(theta) = (a . b) / (|a| * |b|). Implementations are one line in any library.

import numpy as np
def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return float(a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b)))

OpenAI's embeddings are normalized (length 1), so cosine simplifies to just the dot product.

Visualize it

A 2D arrows diagram: two arrows close together (high cosine, ~0.95), two arrows perpendicular (cosine 0), two opposite (-1).

Try it now

Compute cosine between "I love pizza" and "I enjoy pasta" (high). Between "I love pizza" and "Quantum physics" (low).

Hands-on lab

Build a tiny semantic search: 10 sentences embedded, accept a query, return the top-3 by cosine.

Try it now

Why is cosine preferred over Euclidean distance for high-dimensional embeddings?

Common mistakes

Forgetting to normalize (only matters if your provider returns unnormalized vectors)
Confusing cosine with dot product when vectors are not unit length
Using similarity below 0.7 as "very similar" (rarely useful; depends on model)

Debugging tip

If similarity scores cluster in a narrow band (e.g., 0.7 to 0.9), that may be normal for your embedding model. Calibrate to your data, do not use absolute thresholds.

Challenge

Run your top-3 semantic search on 1,000 sentences. Time it. Compare to a naive Python loop vs a NumPy batched implementation.

Where this shows up

Semantic search ranking
Duplicate detection
Clustering and recommendation

From the field

When you scale to millions of vectors, naive cosine is too slow. Vector DBs use Approximate Nearest Neighbor (ANN) indexes (HNSW, IVF) to keep retrieval fast.

Recap

Cosine similarity = direction agreement. It is the universal metric for "find similar" with embeddings.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

What is cosine similarity?

Quiz time

1 question · tap an answer to check it

1. Two identical embeddings have cosine similarity