Sketching a production RAG architecture
A whiteboard sketch is what gets your RAG idea funded. This lesson hands you the canonical 2026 architecture.
The architecture is the blueprint. Even if you do not pour the foundation yourself, you must be able to draw the house.
A canonical 2026 production RAG architecture has:
- Ingestion service: pulls and parses sources (PDFs, web scrapes, DB exports).
- Chunker: splits documents with metadata.
- Embedder: calls an embedding API.
- Vector DB: stores vectors + metadata (Pinecone, Weaviate, Chroma, Qdrant, pgvector).
- Hybrid search: combines vector similarity with keyword/BM25 search.
- Reranker: re-orders top results with a small cross-encoder model.
- LLM generator: produces the final answer.
- Eval and feedback loop: logs queries, retrievals, answers, user thumbs.
Skip steps 5 and 6 for v1. Add them when retrieval quality matters.
A minimal v1 you can ship this month: Loader -> Chunker -> OpenAI embeddings -> ChromaDB -> top-K cosine -> GPT-4o-mini.
A v2 production system: add hybrid search, reranking (Cohere or Voyage reranker), evaluation harness, and a cache.
Quick recall
3 prompts · think before you flip
Prompt 1 of 3
Name 8 components of a production RAG system.
Quiz time
1 question · tap an answer to check it
1. A reranker is typically
Finished lesson 7.5?
Mark complete to update your module progress and unlock the streak.