Pin down the contract
Enterprise RAG is not "chatbot over docs." Clarify:
- Scope: which document corpora (HR policies, engineering wikis, customer contracts, sales decks)? Each has different freshness, sensitivity, and ownership.
- Access control: which user can see which document? Access is part of the product, not an implementation detail. Permission leakage is the headline failure.
- Acceptable error mode: is "I don't know" preferred over a guess? In enterprise, almost always yes.
- Cost & latency budget: per-query cost ceiling, p95 latency, max prompt token size.
State explicitly: the assistant must answer only from retrieved documents the user has permission to read, and must cite every claim back to a source.
High-level architecture
A defensible pipeline has six layers:
- Document ingestion — connectors per source system, content extraction (PDF, DOCX, Slack, Confluence), normalization, chunking.
- Indexing — embedding generation, sparse (BM25) and dense (ANN) indexes, metadata tags (owner, sensitivity, recency), per-document ACL snapshot.
- Retrieval — hybrid search (dense + sparse), filter by user's ACL at query time, return top-K candidates with metadata.
- Reranking — a cross-encoder or LLM-based reranker compresses K candidates into the top 5-10 most relevant.
- Generation — prompt with system instructions ("answer only from the context below; cite each claim"), retrieved chunks, user query.
- Logging & evaluation — store query, retrieved chunks, generated answer, citations, user feedback for offline eval and prompt iteration.
The architect signal is naming where access control is enforced — at retrieval time, not after generation. Filtering after generation leaks information into the LLM's context.
Chunking and embedding choices
Chunk size matters more than people admit:
- Too small (200 tokens) → context fragments, misses cross-paragraph reasoning.
- Too large (2000 tokens) → noisy retrieval, expensive prompts, dilutes the relevant section.
- Sweet spot for most corpora: 500-800 tokens with a 10-15% overlap, plus a parent-document reference for "expand context if needed."
Consider hierarchical chunking — child chunks for retrieval recall, parent sections for generation. And split on semantic boundaries (headings, sections), not just character counts.
Hybrid retrieval — and why
Pure dense retrieval misses exact-match queries (product names, error codes, ticket IDs). Pure BM25 misses paraphrases. Always combine:
- Dense: top-K by embedding similarity.
- Sparse: top-K by BM25.
- Merge: reciprocal rank fusion or a learned reranker over the union.
Mention re-embedding strategy: when a document changes, re-embed only the affected chunks. Bulk re-embedding the whole corpus on every model swap is the operational gotcha that catches teams late.
Citations and grounding
Every claim must trace back. Practical approaches:
- Inline citations — model emits
[source: doc-id, chunk-id]after each claim. Validate at parse time. - Span-level highlighting — UI shows the exact text passage that supports the claim.
- Refuse on ungrounded — if confidence in a citation is low, the model says "I couldn't find this in the available docs."
Evaluation
Evaluate retrieval and generation separately:
- Retrieval: recall@K on a labeled (query, relevant-doc) set, MRR, NDCG.
- Generation: faithfulness (claims grounded in retrieved chunks), answer relevance, citation accuracy, refusal correctness.
- Operational: permission leakage rate (must be near zero), latency by query class (factual / synthesis / search), cost per resolved query.
Use an LLM-as-judge for scaled scoring, but check evaluator agreement against human raters on a sampled subset.
Failure modes worth naming
- Permission leakage — chunk from a restricted doc reaches a user who shouldn't see it. Caused by stale ACL snapshot or post-hoc filtering. Test with red-team queries.
- Stale answers — embedding index lags the source of truth. Solve with event-driven re-indexing, not nightly batch.
- Hallucinated citations — model makes up source IDs. Validate citations exist in the retrieved set; reject if not.
- Cost runaway — long context windows compound. Cap retrieved tokens, use a smaller model for synthesis when retrieval is highly confident.
- Cross-tenant contamination in multi-tenant deployments — fix with per-tenant indexes or hard ACL filters at retrieval.
Rollout strategy
- Internal pilot — small team, narrow corpus, heavy logging, weekly eval review.
- Read-only Q&A — broaden corpus, no actions, refuse-rate goal under 20%.
- Citation-first UX — every answer comes with at least one verifiable citation.
- Department rollout — opt-in by team, dashboard for that team's accuracy / refusal / latency.
- Org-wide — only after permission-leakage rate stays at zero across the pilot.
What the architect signal looks like
Close with the operational view: which two metrics you would put on the team's wall (faithfulness rate and permission-leakage rate), and the one trade-off you would defend hardest under pressure (typically refusal-rate-vs-coverage, with the rationale that an assistant that refuses honestly is more valuable than one that confidently hallucinates).