Design an intelligent chatbot system that uses Retrieval-Augmented Generation (RAG) to answer user queries. The system should be similar to enterprise AI assistants like Glean, which combine information retrieval with language model capabilities to provide contextually relevant responses.
Beyond Simple RAG: This is not just a basic RAG implementation - it's a complete chatbot system with additional complexity layers
Enterprise Requirements: Consider multi-user access, permission management, and data privacy
Quality & Accuracy: Ensure responses are factual, relevant, and properly cite sources
Performance: Balance retrieval quality with response time
During the interview, you may be asked to discuss:
Embedding Strategy: How to convert documents into embeddings and choose appropriate embedding models
Vector Database Selection: Compare different vector databases (Pinecone, Weaviate, Chroma, etc.) for storing and retrieving embeddings
Chunking Strategy: How to split documents into optimal chunks for retrieval
Retrieval Methods: Different approaches to finding relevant context (semantic search, hybrid search, reranking)
Prompt Engineering: How to construct effective prompts that incorporate retrieved context
Citation & Source Tracking: How to maintain and display source attribution
Cache & Performance: Strategies for caching frequently asked questions and optimizing retrieval speed
Evaluation Metrics: How to measure RAG system quality (relevance, accuracy, hallucination detection)
Multi-turn Conversations: Managing conversation context and history
Security & Privacy: Ensuring users only access authorized documents
This problem has comprehensive architectural guides available online. We recommend reviewing these resources:
Medium - Designing High-Performing RAG Systems
Microsoft Azure - RAG Solution Design and Evaluation Guide
Galileo AI - Mastering RAG: Enterprise RAG Architecture
AWS - What is Retrieval-Augmented Generation?
Disclaimer: These resources provide sample architectural approaches and best practices. During your interview, you should develop and articulate your own solution based on your understanding of the requirements, trade-offs, and system design principles. Use these as learning references, not as answers to memorize.
✦ AI-Generated Solution · ML System Design · Comprehensive An enterprise assistant that answers questions over private company documents using Retrieval-Augmented Generation — with permissions, citations, quality/eval, and latency as first-class concerns (this is more than a toy RAG).
Functional
Non-functional

Separate the offline ingestion pipeline (docs → chunks → embeddings → vector DB) from the online query pipeline (query → retrieve → rerank → generate → cite). Permissions are enforced on the retrieval side.
(vector, text, source_uri, acl, timestamp).| Option | Notes |
|---|---|
| pgvector | Easiest if already on Postgres; metadata filtering + SQL ACLs in one place; great default |
| Pinecone | Managed, scales, simple |
| Weaviate / Qdrant / Milvus | OSS, hybrid search, self-host control |
Recommend pgvector when ACL/metadata filtering and operational simplicity matter (enterprise), or a managed vector DB at very large scale. Justify by filtering needs + scale + ops.
source_uri. Optionally verify each cited span actually supports the claim.| Concern | Decision |
|---|---|
| Pipelines | Offline ingest + online query, separated |
| Chunking | Structure-aware, ~200–500 tok, overlap |
| Retrieval | Hybrid (dense+sparse) + cross-encoder rerank |
| Permissions | ACL metadata pre-filter during retrieval |
| Vector DB | pgvector (filter/ops) or managed at scale |
| Grounding | "Answer from context only" + per-claim citations + abstention |
| Latency/cost | Semantic cache, streaming, embedding cache |
| Quality | Retrieval (recall@k/nDCG) + faithfulness/citation eval, golden set |