GeekHub Learn
Module
Lesson 8.22 of 5 in this module2 min read Module 8: Vector Embeddings Simplified

Cosine similarity in 60 seconds

Cosine similarity is the one math concept of this module. It is straightforward, and once you see it, every "vector database" makes sense.

If two arrows point in the same direction, they are similar. If they point opposite ways, they are dissimilar. Cosine similarity is the math of "are these arrows pointing the same way?".

Cosine similarity = the cosine of the angle between two vectors. Range: -1 (opposite) to 1 (identical direction). For embeddings, it usually lies between 0 and 1.

Formula: cos(theta) = (a . b) / (|a| * |b|). Implementations are one line in any library.

import numpy as np
def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return float(a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b)))

OpenAI's embeddings are normalized (length 1), so cosine simplifies to just the dot product.

Visualize it

A 2D arrows diagram: two arrows close together (high cosine, ~0.95), two arrows perpendicular (cosine 0), two opposite (-1).

Try it now

Compute cosine between "I love pizza" and "I enjoy pasta" (high). Between "I love pizza" and "Quantum physics" (low).

Hands-on lab

Build a tiny semantic search: 10 sentences embedded, accept a query, return the top-3 by cosine.

Try it now

Why is cosine preferred over Euclidean distance for high-dimensional embeddings?

Common mistakes

  • Forgetting to normalize (only matters if your provider returns unnormalized vectors)
  • Confusing cosine with dot product when vectors are not unit length
  • Using similarity below 0.7 as "very similar" (rarely useful; depends on model)

Debugging tip

If similarity scores cluster in a narrow band (e.g., 0.7 to 0.9), that may be normal for your embedding model. Calibrate to your data, do not use absolute thresholds.

Challenge

Run your top-3 semantic search on 1,000 sentences. Time it. Compare to a naive Python loop vs a NumPy batched implementation.

Where this shows up

  • Semantic search ranking
  • Duplicate detection
  • Clustering and recommendation

From the field

When you scale to millions of vectors, naive cosine is too slow. Vector DBs use Approximate Nearest Neighbor (ANN) indexes (HNSW, IVF) to keep retrieval fast.

Recap

Cosine similarity = direction agreement. It is the universal metric for "find similar" with embeddings.


Quick recall

3 prompts · think before you flip

Prompt 1 of 3

What is cosine similarity?

Quiz time

1 question · tap an answer to check it

  1. 1. Two identical embeddings have cosine similarity

Finished lesson 8.2?

Mark complete to update your module progress and unlock the streak.

Loading