Lesson 9.7: Evaluation: knowing when RAG is "good enough" | GeekHub Learn

Without an eval set, you ship vibes. With one, you ship products. This lesson hands you the minimum viable eval.

A chef who never tastes their food cannot improve. A RAG engineer who never evals their app is the same.

Build a 20-question eval set with known answers and expected source pages. After every change, re-run, compare. Track three metrics:

Answer correctness (manual or LLM-judge)
Source correctness (did it cite the right pages?)
Faithfulness (did it stick to the sources, or hallucinate?)

import json
EVAL = json.load(open("eval.json"))  # list of {q, expected_page, expected_kw}

for case in EVAL:
    out = "".join(answer(case["q"]))
    citation_ok = f"p.{case['expected_page']}" in out
    keyword_ok = case["expected_kw"].lower() in out.lower()
    print(case["q"], citation_ok, keyword_ok)

For deeper eval, use Ragas, TruLens, or Promptfoo.

Visualize it

A table of eval results with green/red dots per case and a final score.

Try it now

Build 20 questions about a PDF you indexed. Score your current app.

Hands-on lab

Build the eval script. Iterate one improvement (different chunk size, larger K, different LLM). Re-run. Compare.

Try it now

Why is "faithfulness" measured separately from "correctness"?

Common mistakes

Skipping eval ("looks fine to me" is not a metric)
Single-shot eval after one big change (cannot attribute improvement)
Letting evals get stale (re-add new failure cases as you find them)

Debugging tip

When users report a bad answer, add that question to the eval set immediately. Your set grows with your product.

Challenge

Build an LLM-as-judge that grades faithfulness on a 1-5 scale for 20 answers.

Where this shows up

Pre-launch checks
Model upgrade regression
Continuous quality monitoring

From the field

The fastest RAG career growth happens to engineers who own eval. Without it, all "improvements" are guesses.

Recap

20-question eval set, three metrics, iterate one change at a time. This is what separates engineers from prompt-tinkerers.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

Name 3 RAG eval metrics.

Quiz time

1 question · tap an answer to check it

1. The most important RAG eval metric for trust is