Lesson 8.1: What is an embedding, really? | GeekHub Learn

Embeddings are the unsexy plumbing of AI. They run quietly behind ChatGPT, Google Search, Spotify recommendations, and every "find similar" feature you have ever used.

Imagine assigning every word, sentence, or document an address on a giant map. Things with similar meaning live close together. Embeddings are the addresses.

A vector embedding is a list of numbers (e.g., 1,536 floats) that represents the meaning of a piece of text. Similar texts have nearby vectors. Different texts have far apart vectors.

Embeddings are produced by neural networks trained on huge corpora to project text into a meaning-aware coordinate system.

from openai import OpenAI
client = OpenAI()
emb = client.embeddings.create(model="text-embedding-3-small", input="hello").data[0].embedding
print(len(emb), emb[:5])

You will get something like 1536 floats. Two embeddings can be compared with cosine similarity.

Visualize it

A 2D scatter plot (after PCA) showing words like "king", "queen", "man", "woman" clustered in meaningful ways. The famous "king minus man plus woman = queen" visual.

Try it now

Embed three sentences (yours, your friend's, a random one). Compute pairwise similarity. Note the closest pair matches your intuition.

Hands-on lab

In Colab, embed 5 sentences of your choice. Print the dimensions. Compute cosine similarity for all pairs. Visualize as a heatmap.

Try it now

Why are embeddings called "dense" representations as opposed to "sparse" ones like keyword vectors?

Common mistakes

Thinking embeddings are deterministic features (they are learned, model-dependent)
Mixing embeddings from different models (different coordinate systems, incomparable)
Storing embeddings without metadata

Debugging tip

If similarity scores are weirdly uniform, your text inputs are too generic. Embeddings work best on substantive content (>50 words).

Challenge

Embed 100 product titles from a CSV. Cluster them with KMeans (sklearn). Print 3 examples from each cluster.

Where this shows up

Semantic search ("find similar")
RAG retrieval
Deduplication
Classification with nearest-centroid

From the field

A 2026 senior tip: a good embedding + a small classifier outperforms a giant LLM on many narrow classification problems, at 100x lower cost.

Recap

Embeddings turn text into meaning-aware coordinates. They are the foundation of every retrieval, similarity, and clustering feature in modern AI.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

What is an embedding?

Quiz time

1 question · tap an answer to check it

1. An embedding from `text-embedding-3-small` is a list of