Evaluation: knowing when RAG is "good enough"
Without an eval set, you ship vibes. With one, you ship products. This lesson hands you the minimum viable eval.
A chef who never tastes their food cannot improve. A RAG engineer who never evals their app is the same.
Build a 20-question eval set with known answers and expected source pages. After every change, re-run, compare. Track three metrics:
- Answer correctness (manual or LLM-judge)
- Source correctness (did it cite the right pages?)
- Faithfulness (did it stick to the sources, or hallucinate?)
import json
EVAL = json.load(open("eval.json")) # list of {q, expected_page, expected_kw}
for case in EVAL:
out = "".join(answer(case["q"]))
citation_ok = f"p.{case['expected_page']}" in out
keyword_ok = case["expected_kw"].lower() in out.lower()
print(case["q"], citation_ok, keyword_ok)
For deeper eval, use Ragas, TruLens, or Promptfoo.
Quick recall
3 prompts · think before you flip
Prompt 1 of 3
Name 3 RAG eval metrics.
Quiz time
1 question · tap an answer to check it
1. The most important RAG eval metric for trust is
Finished lesson 9.7?
Mark complete to update your module progress and unlock the streak.
Loading