A RAG (Retrieval-Augmented Generation) chatbot answers questions by combining two systems: retrieval (find the best knowledge-base passages) and generation (use an LLM to write a response grounded in those passages). If retrieval is good, the model has evidence—so hallucinations drop and answers become auditable.
This guide walks you through a minimal, practical RAG setup with open-source tools: document ingestion, chunking, embeddings, vector storage, retrieval, citations, follow-ups, and evaluation.
RAG success rule
If retrieval is weak, the LLM cannot save you. Invest first in clean ingestion, chunking, metadata, and retrieval tuning.
1. What RAG Is (And Why It Works)
- RAG retrieves: semantic search finds relevant chunks by meaning, not exact keywords.
- RAG grounds: the LLM is instructed to answer using only the provided context.
- RAG adds traceability: citations point to the exact sources used.
- RAG updates easily: you can refresh the index without retraining a model.
2. Minimal RAG Architecture
Documents → Clean + Chunk → Embeddings → Vector Store
User question → Embed query → Retrieve top-k chunks (+ optional rerank)
LLM prompt (question + retrieved chunks + rules) → Grounded answer + citations
3. Open-Source Stack Options
- LLM (local): Ollama (simple local model runner)
- Embeddings: SentenceTransformers (lightweight, reliable)
- Vector store: Qdrant (service), Chroma (simple), FAISS (embedded index)
- App layer: Python + FastAPI/Flask; UI via simple web page or Streamlit
Minimal “works today” stack
Ollama (LLM) + SentenceTransformers (embeddings) + Qdrant (vector DB) + a small FastAPI service.
4. Step 1: Ingest and Normalize Documents
Good ingestion is mostly about consistency:
- Extract text cleanly: remove navigation, repeated headers/footers.
- Normalize: whitespace, bullets, and broken lines.
- Preserve structure: headings, section titles, and source URLs.
- Track metadata: doc_id, title, section, path, updated_at.
5. Step 2: Chunking Strategy (Settings That Matter)
Chunking is the #1 RAG lever. Start with:
- Chunk size: ~300–600 tokens
- Overlap: ~10–20%
- Split rules: prefer heading/paragraph boundaries over raw character splits
- Keep IDs stable: chunk_id should be deterministic for updates
Chunking pitfall
Chunks that are too large reduce retrieval precision; chunks that are too small lose context and increase “stitched” answers. Tune for your content.
6. Step 3: Create Embeddings
- Use one embedding model consistently for documents and queries.
- Store raw text + metadata alongside the vector.
- Batch processing improves speed for large ingests.
7. Step 4: Store Vectors + Metadata
Your vector store record should include:
- vector: embedding
- payload: chunk_text, doc_title, section, source_url, tags, updated_at
- ids: doc_id, chunk_id
8. Step 5: Retrieval (Top-K, Filters, Rerank)
- Top-k: start with 5–10 chunks retrieved
- Filters: restrict by tag, product, date, or doc source
- Reranking: optional but high impact for precision (especially in noisy corpora)
- Diversity: avoid returning 10 near-duplicate chunks from one section
Easy retrieval win
Add metadata filters (category/tag) and a reranker before changing the LLM.
9. Step 6: Grounded Answer Prompt + Citations
Prompt rules that materially reduce hallucinations:
- Use only provided context; if insufficient, say so.
- Cite sources next to claims (chunk IDs or titles/sections).
- Prefer quotes for definitions when accuracy matters.
- Ask clarifying questions when the query is ambiguous.
Answer format example (conceptual)
- Short answer
- Steps / details
- Citations: [doc_title §section] or [chunk_id]
10. Step 7: Follow-Ups and Conversation Memory
- Short-term memory: keep the last few user turns in the prompt.
- Question rewriting: convert follow-ups into standalone queries for retrieval.
- Keep retrieval fresh: retrieve again on every turn (don’t rely only on prior context).
11. Step 8: Evaluate and Improve Quality
Evaluate both retrieval and generation:
- Retrieval: does the top-k include the answer-bearing chunk?
- Grounding: does the answer stick to provided evidence?
- Helpfulness: does it give actionable steps and handle ambiguity?
- Safety: does it avoid leaking secrets or making risky claims?
RAG evaluation trap
If you only evaluate the final answer, you won’t know whether failures come from retrieval or the prompt/model. Measure both.
12. Deployment Checklist (Local to Production)
- Index refresh: scheduled re-ingest for changed docs.
- Observability: log query, retrieved chunk IDs, latency, and errors.
- Access control: restrict sources by user/team where needed.
- PII handling: avoid indexing sensitive data without policy and controls.
- Fallback behavior: “I don’t know” + escalate path.
- Rate limiting: protect the service and the model runtime.
13. FAQ: RAG Chatbots
Can I run RAG fully locally?
Yes. You can run a local LLM (via Ollama), local embeddings, and a local vector index. Performance depends on your hardware and corpus size.
Do I need reranking?
Not for every dataset, but reranking often improves precision—especially when many chunks are semantically similar.
How do I reduce hallucinations the most?
Improve retrieval quality and enforce “answer only from context” prompting, plus citations and an explicit “insufficient context” fallback.
How do I handle PDFs and messy docs?
Extract and normalize text carefully, remove repeated boilerplate, and chunk by headings where possible. Keep source metadata so you can debug retrieval quickly.
What should I store in metadata?
At minimum: doc title, section heading, source URL/path, updated time, doc_id, and chunk_id. Metadata enables filters and trustworthy citations.
Key RAG terms (quick glossary)
- RAG
- Retrieval-Augmented Generation: retrieve relevant passages, then generate an answer grounded in them.
- Embedding
- A vector representation of text that captures semantic meaning for search.
- Vector database
- A system optimized for storing vectors and running similarity search with metadata filters.
- Chunk
- A smaller segment of a document (with metadata) used for retrieval and prompting.
- Top-k
- The number of retrieved chunks returned by the vector search step.
- Reranker
- A model that re-sorts retrieved candidates to improve relevance/precision.
- Grounding
- Constraining the model to answer using provided evidence (and cite it).
- Query rewriting
- Converting follow-up questions into standalone queries to improve retrieval.
Worth reading
Recommended guides from the category.