Design a RAG-Based Chatbot System

System DesignML System DesignOnsitePhoneMachine Learning EngineerReported Jun, 2026Medium Frequency

Problem Statement

Design an intelligent chatbot system that uses Retrieval-Augmented Generation (RAG) to answer user queries. The system should be similar to enterprise AI assistants like Glean, which combine information retrieval with language model capabilities to provide contextually relevant responses.

Key Design Considerations

Beyond Simple RAG: This is not just a basic RAG implementation - it's a complete chatbot system with additional complexity layers

Enterprise Requirements: Consider multi-user access, permission management, and data privacy

Quality & Accuracy: Ensure responses are factual, relevant, and properly cite sources

Performance: Balance retrieval quality with response time

Common Follow-up Questions

During the interview, you may be asked to discuss:

Embedding Strategy: How to convert documents into embeddings and choose appropriate embedding models

Vector Database Selection: Compare different vector databases (Pinecone, Weaviate, Chroma, etc.) for storing and retrieving embeddings

Chunking Strategy: How to split documents into optimal chunks for retrieval

Retrieval Methods: Different approaches to finding relevant context (semantic search, hybrid search, reranking)

Prompt Engineering: How to construct effective prompts that incorporate retrieved context

Citation & Source Tracking: How to maintain and display source attribution

Cache & Performance: Strategies for caching frequently asked questions and optimizing retrieval speed

Evaluation Metrics: How to measure RAG system quality (relevance, accuracy, hallucination detection)

Multi-turn Conversations: Managing conversation context and history

Security & Privacy: Ensuring users only access authorized documents

Solution Resources

This problem has comprehensive architectural guides available online. We recommend reviewing these resources:

Medium - Designing High-Performing RAG Systems

Microsoft Azure - RAG Solution Design and Evaluation Guide

Galileo AI - Mastering RAG: Enterprise RAG Architecture

AWS - What is Retrieval-Augmented Generation?

Disclaimer: These resources provide sample architectural approaches and best practices. During your interview, you should develop and articulate your own solution based on your understanding of the requirements, trade-offs, and system design principles. Use these as learning references, not as answers to memorize.

Reference solution

#37 Design a RAG-Based Chatbot System (Glean-like) — Solution

✦ AI-Generated Solution · ML System Design · Comprehensive An enterprise assistant that answers questions over private company documents using Retrieval-Augmented Generation — with permissions, citations, quality/eval, and latency as first-class concerns (this is more than a toy RAG).

1. Requirements

Functional

Answer NL questions grounded in enterprise documents (wikis, tickets, docs, code).
Cite sources for every answer.
Multi-turn conversation.
Respect per-user permissions — users only see what they're authorized to.

Non-functional

Factual & low-hallucination; relevant.
Low latency (interactive); cost-aware.
Fresh — new/edited docs become answerable quickly.

2. Architecture (two pipelines)

RAG architecture

Separate the offline ingestion pipeline (docs → chunks → embeddings → vector DB) from the online query pipeline (query → retrieve → rerank → generate → cite). Permissions are enforced on the retrieval side.

3. Ingestion Pipeline

Connectors pull from sources (Drive, Confluence, Slack, GitHub…) with ACL metadata captured per document (who can see it).
Chunking: split into semantically coherent chunks (~200–500 tokens) with overlap (~10–20%); prefer structure-aware splitting (headings, code blocks) over naive fixed windows. Chunk size is a recall/precision knob — discuss it.
Embedding: encode each chunk with an embedding model (e.g. a strong text-embedding model; domain fine-tuning if needed). Store (vector, text, source_uri, acl, timestamp).
Vector store + metadata: index vectors with metadata filters for ACL and recency.
Freshness: incremental re-index via CDC/webhooks from sources; re-embed changed chunks only.

4. Retrieval (quality lives here)

Hybrid retrieval: combine dense (semantic, vector) + sparse (BM25/keyword) — dense catches paraphrase, sparse catches exact terms/IDs/acronyms. Fuse scores (e.g. Reciprocal Rank Fusion).
Permission filtering: apply the user's ACL as a metadata filter during retrieval (pre-filter), so unauthorized chunks are never even candidates — never rely on the LLM to "not mention" them.
Reranking: take top-50 candidates → a cross-encoder reranker → top-5–8 most relevant. This is the single biggest quality lever after good chunks.
Query transforms: for multi-turn, rewrite the follow-up into a standalone query using conversation history before retrieving.

5. Vector Database Choice

Option	Notes
pgvector	Easiest if already on Postgres; metadata filtering + SQL ACLs in one place; great default
Pinecone	Managed, scales, simple
Weaviate / Qdrant / Milvus	OSS, hybrid search, self-host control

Recommend pgvector when ACL/metadata filtering and operational simplicity matter (enterprise), or a managed vector DB at very large scale. Justify by filtering needs + scale + ops.

6. Generation, Citations & Multi-turn

Prompt construction: system instructions ("answer only from context, say 'I don't know' otherwise") + retrieved chunks (each tagged with a source id) + conversation summary + user question. Manage the context budget (truncate/compress lowest-ranked chunks).
Grounded citations: ask the model to attach the source id(s) it used per claim; map ids back to source_uri. Optionally verify each cited span actually supports the claim.
Multi-turn: keep a rolling summary + last-N turns; rewrite follow-ups to standalone queries (Section 4).
Hallucination control: instruct abstention when context is weak; threshold on retrieval scores → "no good source found."

7. Caching & Performance

Semantic cache: embed the query; if a previous query is near-identical (cosine ≥ threshold), return the cached answer → big latency/cost win for FAQs.
Cache embeddings; precompute for hot docs; stream tokens to the UI for perceived latency.

8. Evaluation (interviewers push on this)

Retrieval: recall@k, MRR, nDCG on a labeled query→doc set.
Generation: faithfulness/groundedness (does the answer follow from context?), answer relevance, citation correctness — automated via an LLM-judge plus a human-rated golden set (RAGAS-style).
Hallucination rate and abstention correctness; track regressions on every index/model change.
Online: thumbs up/down, click-through on citations, deflection rate.

9. Security & Privacy

ACL pre-filtering at retrieval (Section 4); per-tenant isolation of indexes.
PII handling, audit logs of what was retrieved for whom; encryption at rest/in transit.

10. Summary

Concern	Decision
Pipelines	Offline ingest + online query, separated
Chunking	Structure-aware, ~200–500 tok, overlap
Retrieval	Hybrid (dense+sparse) + cross-encoder rerank
Permissions	ACL metadata pre-filter during retrieval
Vector DB	pgvector (filter/ops) or managed at scale
Grounding	"Answer from context only" + per-claim citations + abstention
Latency/cost	Semantic cache, streaming, embedding cache
Quality	Retrieval (recall@k/nDCG) + faithfulness/citation eval, golden set

WhiteboardAuto-save enabled

Loading whiteboard…