GeekHub Learn
Module
Lesson 9.77 of 8 in this module2 min read Module 9: Building a PDF Chatbot (RAG Project)

Evaluation: knowing when RAG is "good enough"

Without an eval set, you ship vibes. With one, you ship products. This lesson hands you the minimum viable eval.

A chef who never tastes their food cannot improve. A RAG engineer who never evals their app is the same.

Build a 20-question eval set with known answers and expected source pages. After every change, re-run, compare. Track three metrics:

  • Answer correctness (manual or LLM-judge)
  • Source correctness (did it cite the right pages?)
  • Faithfulness (did it stick to the sources, or hallucinate?)
import json
EVAL = json.load(open("eval.json"))  # list of {q, expected_page, expected_kw}

for case in EVAL:
    out = "".join(answer(case["q"]))
    citation_ok = f"p.{case['expected_page']}" in out
    keyword_ok = case["expected_kw"].lower() in out.lower()
    print(case["q"], citation_ok, keyword_ok)

For deeper eval, use Ragas, TruLens, or Promptfoo.

Visualize it

A table of eval results with green/red dots per case and a final score.

Try it now

Build 20 questions about a PDF you indexed. Score your current app.

Hands-on lab

Build the eval script. Iterate one improvement (different chunk size, larger K, different LLM). Re-run. Compare.

Try it now

Why is "faithfulness" measured separately from "correctness"?

Common mistakes

  • Skipping eval ("looks fine to me" is not a metric)
  • Single-shot eval after one big change (cannot attribute improvement)
  • Letting evals get stale (re-add new failure cases as you find them)

Debugging tip

When users report a bad answer, add that question to the eval set immediately. Your set grows with your product.

Challenge

Build an LLM-as-judge that grades faithfulness on a 1-5 scale for 20 answers.

Where this shows up

  • Pre-launch checks
  • Model upgrade regression
  • Continuous quality monitoring

From the field

The fastest RAG career growth happens to engineers who own eval. Without it, all "improvements" are guesses.

Recap

20-question eval set, three metrics, iterate one change at a time. This is what separates engineers from prompt-tinkerers.


Quick recall

3 prompts · think before you flip

Prompt 1 of 3

Name 3 RAG eval metrics.

Quiz time

1 question · tap an answer to check it

  1. 1. The most important RAG eval metric for trust is

Finished lesson 9.7?

Mark complete to update your module progress and unlock the streak.

Loading