Back

Design a RAG-Based Chatbot System

System DesignML System DesignOnsitePhoneMachine Learning EngineerReported Jun, 2026Medium Frequency

Problem Statement

Design an intelligent chatbot system that uses Retrieval-Augmented Generation (RAG) to answer user queries. The system should be similar to enterprise AI assistants like Glean, which combine information retrieval with language model capabilities to provide contextually relevant responses.

Key Design Considerations

Beyond Simple RAG: This is not just a basic RAG implementation - it's a complete chatbot system with additional complexity layers

Enterprise Requirements: Consider multi-user access, permission management, and data privacy

Quality & Accuracy: Ensure responses are factual, relevant, and properly cite sources

Performance: Balance retrieval quality with response time

Common Follow-up Questions

During the interview, you may be asked to discuss:

Embedding Strategy: How to convert documents into embeddings and choose appropriate embedding models

Vector Database Selection: Compare different vector databases (Pinecone, Weaviate, Chroma, etc.) for storing and retrieving embeddings

Chunking Strategy: How to split documents into optimal chunks for retrieval

Retrieval Methods: Different approaches to finding relevant context (semantic search, hybrid search, reranking)

Prompt Engineering: How to construct effective prompts that incorporate retrieved context

Citation & Source Tracking: How to maintain and display source attribution

Cache & Performance: Strategies for caching frequently asked questions and optimizing retrieval speed

Evaluation Metrics: How to measure RAG system quality (relevance, accuracy, hallucination detection)

Multi-turn Conversations: Managing conversation context and history

Security & Privacy: Ensuring users only access authorized documents

Solution Resources

This problem has comprehensive architectural guides available online. We recommend reviewing these resources:

Medium - Designing High-Performing RAG Systems

Microsoft Azure - RAG Solution Design and Evaluation Guide

Galileo AI - Mastering RAG: Enterprise RAG Architecture

AWS - What is Retrieval-Augmented Generation?

Disclaimer: These resources provide sample architectural approaches and best practices. During your interview, you should develop and articulate your own solution based on your understanding of the requirements, trade-offs, and system design principles. Use these as learning references, not as answers to memorize.


Reference solution

#37 Design a RAG-Based Chatbot System (Glean-like) — Solution

✦ AI-Generated Solution · ML System Design · Comprehensive An enterprise assistant that answers questions over private company documents using Retrieval-Augmented Generation — with permissions, citations, quality/eval, and latency as first-class concerns (this is more than a toy RAG).


1. Requirements

Functional

  • Answer NL questions grounded in enterprise documents (wikis, tickets, docs, code).
  • Cite sources for every answer.
  • Multi-turn conversation.
  • Respect per-user permissions — users only see what they're authorized to.

Non-functional

  • Factual & low-hallucination; relevant.
  • Low latency (interactive); cost-aware.
  • Fresh — new/edited docs become answerable quickly.

2. Architecture (two pipelines)

RAG architecture

Separate the offline ingestion pipeline (docs → chunks → embeddings → vector DB) from the online query pipeline (query → retrieve → rerank → generate → cite). Permissions are enforced on the retrieval side.

3. Ingestion Pipeline

  • Connectors pull from sources (Drive, Confluence, Slack, GitHub…) with ACL metadata captured per document (who can see it).
  • Chunking: split into semantically coherent chunks (~200–500 tokens) with overlap (~10–20%); prefer structure-aware splitting (headings, code blocks) over naive fixed windows. Chunk size is a recall/precision knob — discuss it.
  • Embedding: encode each chunk with an embedding model (e.g. a strong text-embedding model; domain fine-tuning if needed). Store (vector, text, source_uri, acl, timestamp).
  • Vector store + metadata: index vectors with metadata filters for ACL and recency.
  • Freshness: incremental re-index via CDC/webhooks from sources; re-embed changed chunks only.

4. Retrieval (quality lives here)

  • Hybrid retrieval: combine dense (semantic, vector) + sparse (BM25/keyword) — dense catches paraphrase, sparse catches exact terms/IDs/acronyms. Fuse scores (e.g. Reciprocal Rank Fusion).
  • Permission filtering: apply the user's ACL as a metadata filter during retrieval (pre-filter), so unauthorized chunks are never even candidates — never rely on the LLM to "not mention" them.
  • Reranking: take top-50 candidates → a cross-encoder reranker → top-5–8 most relevant. This is the single biggest quality lever after good chunks.
  • Query transforms: for multi-turn, rewrite the follow-up into a standalone query using conversation history before retrieving.

5. Vector Database Choice

OptionNotes
pgvectorEasiest if already on Postgres; metadata filtering + SQL ACLs in one place; great default
PineconeManaged, scales, simple
Weaviate / Qdrant / MilvusOSS, hybrid search, self-host control

Recommend pgvector when ACL/metadata filtering and operational simplicity matter (enterprise), or a managed vector DB at very large scale. Justify by filtering needs + scale + ops.

6. Generation, Citations & Multi-turn

  • Prompt construction: system instructions ("answer only from context, say 'I don't know' otherwise") + retrieved chunks (each tagged with a source id) + conversation summary + user question. Manage the context budget (truncate/compress lowest-ranked chunks).
  • Grounded citations: ask the model to attach the source id(s) it used per claim; map ids back to source_uri. Optionally verify each cited span actually supports the claim.
  • Multi-turn: keep a rolling summary + last-N turns; rewrite follow-ups to standalone queries (Section 4).
  • Hallucination control: instruct abstention when context is weak; threshold on retrieval scores → "no good source found."

7. Caching & Performance

  • Semantic cache: embed the query; if a previous query is near-identical (cosine ≥ threshold), return the cached answer → big latency/cost win for FAQs.
  • Cache embeddings; precompute for hot docs; stream tokens to the UI for perceived latency.

8. Evaluation (interviewers push on this)

  • Retrieval: recall@k, MRR, nDCG on a labeled query→doc set.
  • Generation: faithfulness/groundedness (does the answer follow from context?), answer relevance, citation correctness — automated via an LLM-judge plus a human-rated golden set (RAGAS-style).
  • Hallucination rate and abstention correctness; track regressions on every index/model change.
  • Online: thumbs up/down, click-through on citations, deflection rate.

9. Security & Privacy

  • ACL pre-filtering at retrieval (Section 4); per-tenant isolation of indexes.
  • PII handling, audit logs of what was retrieved for whom; encryption at rest/in transit.

10. Summary

ConcernDecision
PipelinesOffline ingest + online query, separated
ChunkingStructure-aware, ~200–500 tok, overlap
RetrievalHybrid (dense+sparse) + cross-encoder rerank
PermissionsACL metadata pre-filter during retrieval
Vector DBpgvector (filter/ops) or managed at scale
Grounding"Answer from context only" + per-claim citations + abstention
Latency/costSemantic cache, streaming, embedding cache
QualityRetrieval (recall@k/nDCG) + faithfulness/citation eval, golden set
WhiteboardAuto-save enabled
Loading whiteboard…