Module 7: Introduction to RAG (Retrieval Augmented Generation)

Module Goal

Build a complete mental model of RAG: what it is, why it exists, its pipeline, its costs, and its failure modes. By the end you can architect a RAG system on a whiteboard, even before writing code.

Estimated Duration

3 to 4 hours.

Skills Learned

Defining RAG and the problem it solves
Drawing the 5-step RAG pipeline from memory
Picking RAG over fine-tuning correctly
Spotting RAG failure modes
Sketching a production RAG architecture

Real-world Importance

RAG is the dominant 2026 pattern for adding private, fresh, or proprietary knowledge to LLMs. Every "Chat with your X" product uses it. Knowing RAG inside-out is non-negotiable for an AI engineering career.

Lessons in this module

The problem RAG solves
The 5-step RAG pipeline
RAG vs fine-tuning vs long context
Where RAG shines, and where it fails
Sketching a production RAG architecture

Lesson 7.1: The problem RAG solves

Hook / Why This Matters

Ask ChatGPT about your company's internal handbook. It will guess. Confidently. Wrongly. RAG is the fix. This is the lesson where that clicks.

Beginner Analogy

A brilliant new employee on day one knows the world but not your company. Hand them the employee handbook before each meeting and they will sound like a 5-year veteran. RAG hands the handbook to the LLM right before each answer.

Concept Explanation

LLMs have two knowledge limits:

Cutoff: training data has a date. Anything after that does not exist for the model.
Privacy: the model was not trained on your internal docs, your PDFs, your DB.

RAG fixes both by retrieving relevant snippets from a knowledge source at inference time and adding them to the prompt. The model still does the answering. The knowledge is just-in-time.

Technical Breakdown

A naive prompt: "What is GeekHub's reputation algorithm?" -> LLM guesses.

A RAG prompt: "Use the following snippets from GeekHub's docs to answer. [snippet1] [snippet2] Question: What is GeekHub's reputation algorithm?" -> LLM grounds its answer in the snippets.

The retrieval step picks the most relevant snippets. The augmentation step packs them into the prompt. The generation step answers.

Visual Learning Suggestion

A side-by-side diagram: "Without RAG" arrow from question to hallucinated answer; "With RAG" arrow goes through a "Retrieve" box, then "Augment", then "Generate", landing on a grounded answer.

Interactive Element

Ask ChatGPT a question about a niche topic only documented on one website. Note its hedging. Now copy a paragraph from that website into the prompt and ask again. Watch confidence and accuracy rise.

Hands-on Lab

Take any 200-word paragraph from a website. Write a question whose answer is only in that text. Without the paragraph, ask the LLM. With the paragraph (pasted into the prompt), ask again. Document the difference.

Mini Exercise

Why is RAG better than waiting for the next model retrain to absorb your docs?

Common Mistakes

Calling everything that augments a prompt "RAG" (RAG specifically uses retrieval)
Confusing RAG with fine-tuning (different mechanism, different cost shape)
Believing larger context windows make RAG obsolete (they reduce, not eliminate, the need)

Debugging Tips

If your RAG answers hallucinate, the retrieval probably failed: wrong chunk, missing chunk, irrelevant chunk. The model is rarely the culprit.

Knowledge Check Questions

What two LLM limits does RAG solve?
What does RAG add to a prompt?
When is retrieval the wrong fix?

Quiz Questions

RAG augments prompts with: a) Random chunks b) Pre-fetched snippets relevant to the question c) The full database d) Nothing Answer: b

Challenge Task

Write a one-page explanation of RAG for a non-engineer manager who is deciding whether to fund a RAG project.

Real-world Use Cases

Internal company Q&A bots
Legal and medical document assistants
Code repo "ask the codebase" tools
Customer support backed by product docs

Industry Insight

In 2026 over 70% of enterprise AI projects are some flavor of RAG. The skill is widely demanded, often poorly executed, and a clear differentiator.

Interview Questions

What problem does RAG solve?
Why not just fine-tune?
What signals a RAG project that will fail?

Summary

RAG injects relevant outside knowledge into the prompt at query time. It solves the knowledge cutoff and the privacy gap that pure LLMs cannot fix on their own.

Lesson 7.2: The 5-step RAG pipeline

Hook / Why This Matters

If you can recite the 5 steps from memory, you can design, debug, and explain any RAG system in the world. This lesson hands you that mental model.

Beginner Analogy

A library system: catalog the books, write index cards, store the cards, look up by topic, hand the right cards to the reader. Same five steps in RAG.

Concept Explanation

The 5 steps:

Load: pull documents from their source (PDF, web, DB).
Chunk: split documents into small overlapping pieces.
Embed: convert each chunk to a vector.
Store: index the vectors in a vector database.
Retrieve and Generate: embed the user question, find nearest chunks, stuff them into the prompt, generate.

Steps 1 to 4 happen offline (or at ingest time). Step 5 happens per query.

Technical Breakdown

Offline ingest pipeline:

PDFs -> text -> chunks of ~500 tokens with 50-token overlap -> embeddings -> ChromaDB

Online query pipeline:

user query -> embedding -> top-K chunk lookup -> prompt = [system + top chunks + user query] -> LLM -> answer

Default starting values: chunk_size=500 tokens, overlap=50 tokens, top_K=4 to 6 chunks.

Visual Learning Suggestion

A two-row pipeline diagram: top row "Ingest (offline)" with 4 boxes; bottom row "Query (online)" with 4 boxes. Arrows show data flow.

Interactive Element

Take any document. Manually chunk it into 300-word pieces. For a question, by hand pick the 2 chunks you would want the model to see. Notice you just did RAG without code.

Hands-on Lab

On paper, sketch the 5-step pipeline. Label each step with what data flows in and out. Take a photo.

Mini Exercise

Why is chunk overlap useful?

Common Mistakes

Skipping chunking and embedding whole documents (kills retrieval quality)
Chunking too small (loses context) or too large (over-retrieves)
Forgetting to filter out duplicate or near-identical chunks

Debugging Tips

When RAG quality is poor, instrument retrieval first: log which chunks were retrieved for failing questions. Almost always the issue is there.

Knowledge Check Questions

List the 5 steps in order.
Which steps are offline vs online?
What is chunk overlap?

Quiz Questions

The first step of a RAG pipeline is: a) Embedding b) Loading c) Chunking d) Retrieval Answer: b

Challenge Task

For a 100-page PDF, propose a chunking strategy with rationale: chunk size, overlap, metadata fields, expected retrieval count.

Real-world Use Cases

All "chat with your docs" features
Internal company assistants
Code search bots

Industry Insight

A senior RAG engineer's job is mostly chunk strategy, metadata design, and retrieval evaluation. The LLM is the least-tunable part.

Interview Questions

Walk me through a RAG pipeline.
How do you choose chunk size?
What is the role of metadata in retrieval?

Summary

Load, chunk, embed, store, retrieve and generate. Memorize this and you can talk RAG with anyone.

Lesson 7.3: RAG vs fine-tuning vs long context

Hook / Why This Matters

The three biggest spending decisions in AI engineering all start with this question. Get it wrong and you waste months. Get it right and you ship in weeks.

Beginner Analogy

You need a chef who knows your favorite dishes. Three options: send them to cooking school (fine-tune), hand them a recipe card before each meal (RAG), or give them the whole cookbook every time (long context). Each has tradeoffs.

Concept Explanation

Need	Pick
Inject up-to-date or private knowledge	RAG
Change tone, format, or style	Fine-tune
Add stable, narrow behavior (JSON shape, classification)	Fine-tune or prompt
One-off use of a long doc	Long context
Lots of varied docs queried often	RAG
Bake in a domain language (medical, legal)	Fine-tune on top of RAG

Technical Breakdown

RAG: low setup cost, no model training, easy updates, ~$0.01-0.10 per query.

Fine-tune: medium setup cost, requires labeled data, hard to update, hosting cost.

Long context: zero setup, simple, expensive per query past ~50K tokens, suffers lost-in-the-middle.

The 2026 winning pattern for most enterprises: RAG first, fine-tune only when style or format must be guaranteed.

Visual Learning Suggestion

A 3-axis "decision triangle" with RAG, Fine-tune, and Long context at the corners, and example use cases plotted inside.

Interactive Element

Take 5 imagined product features. Place each on the decision triangle. Justify each placement in one sentence.

Hands-on Lab

Write a 1-page decision memo for a "chat with our HR policy" feature. Argue for RAG vs fine-tune vs long context.

Mini Exercise

Why is updating knowledge in a fine-tuned model harder than in a RAG system?

Common Mistakes

Defaulting to fine-tuning to "make the model smarter" (it does not work that way)
Using long context for repeated queries (paying every time for the same input)
Underestimating the eng cost of RAG quality

Debugging Tips

If your fine-tuned model gives stale answers, the data baked into it is out of date. You need RAG or a refresh.

Knowledge Check Questions

When is RAG the right call?
When is fine-tuning the right call?
When does long context win?

Quiz Questions

Up-to-date company knowledge is best served by: a) Fine-tuning b) RAG c) Long context only d) Prompt magic Answer: b

Challenge Task

Pick any real product you use. Argue what part should be RAG, what should be fine-tuned, what should be long context.

Real-world Use Cases

RAG: internal docs, news, support
Fine-tune: tone-of-voice, structured outputs at scale
Long context: one-off legal doc review

Industry Insight

The 2026 pattern: layered. Fine-tune for style, RAG for knowledge, long context for one-offs. Senior engineers compose.

Interview Questions

Compare RAG, fine-tuning, and long context.
When would you combine them?
Why is fine-tuning bad at "learning facts"?

Summary

RAG for knowledge, fine-tune for style, long context for one-offs. Combine when you must, but always start with RAG.

Lesson 7.4: Where RAG shines, and where it fails

Hook / Why This Matters

Most failed RAG projects in 2026 used RAG for the wrong task. This lesson is the screening test.

Beginner Analogy

A search engine plus a writer is great for "tell me what these docs say". Less great for "predict next quarter's revenue".

Concept Explanation

RAG shines when:

The answer is somewhere in your corpus, in roughly its current form
Questions are specific
Sources are text-heavy and well-structured
Updates are frequent
Answers must cite sources

RAG fails when:

The answer requires synthesis across hundreds of pages
Questions are open-ended ("what should we do?")
The corpus is mostly images or audio (use multimodal retrieval)
The data is fragmented (each chunk lacks self-contained context)
The user expects creativity, not retrieval

Technical Breakdown

A failure pattern: chunks that lack headers or context. A chunk that says "It is 12.5%" with no context is useless. Always preserve metadata (document title, section, date) on each chunk so the retrieved snippet is self-contained.

Visual Learning Suggestion

A two-column table "RAG shines" vs "RAG fails" with 6 examples each.

Interactive Element

Pick 5 questions about your own notes. Identify which would and would not be answerable by RAG and why.

Hands-on Lab

Take 2 documents you wrote. Identify 3 questions each. Predict RAG success. Once you build the RAG system in Module 9, come back and check.

Mini Exercise

Why is "what should we do?" a bad RAG question?

Common Mistakes

Using RAG for predictive or judgment questions
Letting chunks drop their source metadata
Hoping the LLM "synthesizes" facts that are not in any retrieved chunk

Debugging Tips

If retrieval looks good but answers are wrong, your chunks are likely missing context. Add titles, dates, and section headers to every chunk.

Knowledge Check Questions

Name 3 RAG-suited tasks.
Name 3 RAG-unsuited tasks.
Why does chunk metadata matter?

Quiz Questions

RAG is a poor fit for: a) "What does our refund policy say?" b) "Summarize this contract." c) "Predict next quarter's revenue." d) "What did we ship last week?" Answer: c

Challenge Task

Audit a real "AI chatbot" you have used. Identify 3 questions it answered well and 3 it dodged or hallucinated. Diagnose which were RAG-mismatched.

Real-world Use Cases

Good: customer support, internal docs, code search
Bad: open-ended brainstorming, forecasting, creative writing

Industry Insight

Saying no to a wrong RAG use case earns more trust than building it badly. Build the muscle of screening.

Interview Questions

Give an example of a question RAG cannot answer.
How do you preserve source context in chunks?
How do you evaluate RAG quality?

Summary

RAG shines on specific, source-grounded questions. It fails on judgment, prediction, and synthesis past one doc. Screen ruthlessly.

Lesson 7.5: Sketching a production RAG architecture

Hook / Why This Matters

A whiteboard sketch is what gets your RAG idea funded. This lesson hands you the canonical 2026 architecture.

Beginner Analogy

The architecture is the blueprint. Even if you do not pour the foundation yourself, you must be able to draw the house.

Concept Explanation

A canonical 2026 production RAG architecture has:

Ingestion service: pulls and parses sources (PDFs, web scrapes, DB exports).
Chunker: splits documents with metadata.
Embedder: calls an embedding API.
Vector DB: stores vectors + metadata (Pinecone, Weaviate, Chroma, Qdrant, pgvector).
Hybrid search: combines vector similarity with keyword/BM25 search.
Reranker: re-orders top results with a small cross-encoder model.
LLM generator: produces the final answer.
Eval and feedback loop: logs queries, retrievals, answers, user thumbs.

Skip steps 5 and 6 for v1. Add them when retrieval quality matters.

Technical Breakdown

A minimal v1 you can ship this month: Loader -> Chunker -> OpenAI embeddings -> ChromaDB -> top-K cosine -> GPT-4o-mini.

A v2 production system: add hybrid search, reranking (Cohere or Voyage reranker), evaluation harness, and a cache.

Visual Learning Suggestion

A boxes-and-arrows architecture diagram with 8 boxes for v2 and a "v1" subset highlighted.

Interactive Element

Pick a real problem (e.g., chat with the GeekHub docs). Sketch v1 and v2. Mark which components you would ship first.

Hands-on Lab

Draw the architecture on paper. Photograph it. Save it for Module 9 where you build the real thing.

Mini Exercise

Why is a reranker useful even when top-K retrieval is "good enough" already?

Common Mistakes

Designing v2 before v1 is shipped
Skipping the eval and feedback loop (no way to improve)
Picking a vector DB before you know your data shape

Debugging Tips

If you cannot draw the architecture in 5 minutes, you do not understand the system well enough to build it.

Knowledge Check Questions

Name 8 components of a production RAG system.
Which can you skip in v1?
What does a reranker do?

Quiz Questions

A reranker is typically: a) A bigger LLM b) A small cross-encoder that re-orders top-K results c) A vector DB d) An embedding model Answer: b

Challenge Task

Pick a real product idea. Draw v1 and v2 architectures. Estimate cost per query for each.

Real-world Use Cases

Internal Q&A assistants
Public-facing docs bots
Codebase-aware coding assistants

Industry Insight

The single most underrated skill is being able to sketch a clear architecture in a meeting. It is half of the senior engineer interview at AI companies.

Interview Questions

Sketch a production RAG architecture.
What is the role of a reranker?
How would you evaluate RAG quality?

Summary

8 components, v1 to v2 progression, sketchable in 5 minutes. You are now RAG-architecture literate.

Module 7 Recap

You understand RAG end-to-end without writing code. You can recite the pipeline, compare it to alternatives, screen for the right use case, and sketch the architecture. Module 8 makes you fluent in embeddings, the heart of step 3.

Next Module

Module 8: Vector Embeddings Simplified