RAG in Production: What Breaks at Enterprise Scale

By Aakash Ahuja·June 14, 2026·29 min read

RAG in production does not usually fail because the language model cannot write a good answer. It fails because the retrieval layer is allowed to behave like a demo search box inside an enterprise.

Retrieval-Augmented Generation, or RAG, is the pattern where an LLM retrieves external knowledge before generating an answer. The original RAG work combined a parametric model with non-parametric memory retrieved from a dense index, and modern cloud platforms describe RAG as a way to ground model output in external data sources rather than relying only on model training data.

That definition is accurate, but incomplete for production.

At enterprise scale, RAG is not just "LLM + vector database." It becomes a governed retrieval system that must decide which data is eligible, current, authorized, precise, traceable, and safe enough to enter the model context.

The hard production question is not:

Can we retrieve similar chunks?

The hard question is:

Can we retrieve the right evidence, from the right version, for the right user, under the right access boundary, fast enough, with enough traceability to defend the answer later?

That is where most RAG systems start breaking.

Why RAG prototypes work before they fail
The real setup: from SQLite + FAISS to enterprise constraints
What breaks first: retrieval quality
Why vector search alone is not enough
Why stale data becomes a production risk
Why access boundaries must be enforced before retrieval
Why OCR, tables, and PDFs quietly damage RAG quality
What architecture works for enterprise RAG
How to evaluate production RAG
What worked, what did not, and what to do next
Production RAG checklist
FAQ
Key takeaways

---

Why do RAG prototypes work before they fail?

A RAG prototype usually works because the environment is forgiving.

The dataset is small. The documents are mostly clean. The users are trusted. The question set is narrow. The data does not change much. Nobody is checking whether the answer came from the latest version of the policy. Nobody is asking whether a sales user accidentally retrieved a finance-only document.

That is why the first demo feels magical.

You upload files, chunk them, create embeddings, store them in FAISS or a vector database, ask a question, retrieve top-k chunks, pass them to the LLM, and get a grounded answer.

For a demo, this is enough.

For an enterprise, this is the beginning of the problem.

A production RAG system has to survive conditions that the prototype avoided:

Prototype assumption	Enterprise reality
All users can see all documents	Users have roles, teams, regions, tenants, and exceptions
Documents are static	Policies, contracts, SOPs, catalogs, tickets, and reports change
Text extraction is clean	PDFs, scans, tables, merged cells, images, and forms break structure
Top-k similarity is enough	Similarity does not prove authority, freshness, or applicability
One index is manageable	Multi-tenant systems need isolation, filtering, versioning, and audit trails
Good answer means good system	The system must prove which sources were used and why
Latency does not matter much	Production users expect seconds, not minutes

The first serious realization is this:

RAG is easy when all data is equally trusted. Enterprise RAG starts when data is not equally visible, current, or authoritative.

Production RAG reference architecture showing source connectors, ingestion queue, extraction pipeline, metadata store, vector and full-text indexes, permission-aware retrieval, answer generation, and an audit and governance layer.

What was the original setup, and why was it not enough?

One practical journey started with a custom RAG system built without LangChain.

The early stack was intentionally simple:

SQLite for chunks, entities, relationships, facts, and events.
FAISS for vector search.
OpenAI for embeddings and generation.
A custom retrieval pipeline.
File ingestion with OCR and table extraction.
No Kafka.
AWS as the likely production environment.
India-only storage as a deployment constraint.
Tenant separation as a known future requirement.

This was a good MVP architecture.

It gave control over the pipeline. It made the data model visible. It avoided framework magic. It allowed experimentation with chunks, entities, relationships, facts, and events instead of reducing everything to anonymous text blocks.

But the production constraints were very different:

ingest around 100 documents per day,
batch ingest around 10 documents per minute,
handle OCR, table extraction, and LLM structuring,
serve around 100 QPS for retrieval,
support hybrid search,
isolate tenants,
keep response latency low,
monitor accuracy and drift,
preserve traceability,
avoid cross-tenant leakage,
and reduce a full answer path that was taking roughly two minutes.

That is the gap between a RAG MVP and production RAG.

The MVP proves the retrieval pattern. Production proves the operating system around it.

What breaks first in RAG in production?

Retrieval quality breaks first.

Not generation. Not the LLM. Retrieval.

The model can only answer from the evidence it receives. If the right chunk is not retrieved, the model has three bad options:

answer from incomplete context,
hallucinate missing details,
refuse or give a vague answer.

A capable LLM cannot compensate for systematically bad retrieval.

A 2026 production-style RAG fusion study makes the same point in a different way: retrieval improvements do not automatically translate into better end-to-end answers once reranking limits, context-window limits, and latency constraints enter the system.

That observation matters because many teams optimize isolated retrieval metrics and assume the answer quality will improve. In production, extra recall can get neutralized by reranking, deduplication, truncation, conflicting chunks, or latency limits.

Failure mode 1: the right content exists but is not retrieved

One real retrieval problem looked like this:

Query: "Give me list of items needed for visual inspection of machine."

The required content existed, but retrieval missed it because the system did not match the phrase "list of items" to the relevant procedural section.

This is a common enterprise problem.

Users do not ask questions using the exact language of SOPs, manuals, or engineering documents. They ask in operational language:

"items needed"
"things to check"
"what do I inspect"
"machine visual inspection"
"pre-check list"
"before starting"
"required materials"

A pure vector search may catch some of this. A keyword search may catch some of this. Neither is enough by itself.

The fix is not "use a better embedding model" as the first response. The better response is to redesign retrieval:

preserve section headings,
capture document type and process step metadata,
use hybrid keyword + vector search,
add domain synonyms,
use query rewriting carefully,
rerank after candidate retrieval — a reranker is a second-pass model that reorders the retrieved candidates by true relevance before the top few reach the LLM,
log missed-answer cases,
and build a retrieval test set from real user queries.

Failure mode 2: chunks are not the unit of meaning

Many RAG systems treat chunks as if they are naturally meaningful.

They are not.

A chunk is often just an artifact of token limits.

In enterprise documents, the actual unit of meaning may be a policy clause, a procedure step, a row in a table, a field in a form, a machine inspection checklist, a contract obligation, a report finding, a support resolution, or a regulation-to-SOP mapping.

If chunking cuts across those boundaries, retrieval quality is damaged before embeddings are created.

Bad chunking creates several problems:

Bad chunking pattern	Production impact
Heading separated from body	Retrieved text loses meaning
Table row separated from header	Values become ambiguous
Clause split across chunks	Answer misses conditions and exceptions
Large chunk with many topics	Similarity score becomes noisy
Tiny chunk with no context	Reranker cannot judge relevance
Duplicated boilerplate chunks	Generic text outranks specific answer
Old and new versions both indexed	Stale source may win retrieval

Chunking is not preprocessing. Chunking is part of the reasoning boundary.

Failure mode 3: similar does not mean authoritative

Vector search finds semantic similarity.

It does not know whether a retrieved chunk is current, approved, superseded, tenant-specific, region-specific, role-visible, legally binding, draft, archived, or contradicted by a newer document.

This is why one of the strongest rules for production RAG is:

Similarity is not authority.

A highly similar old policy can be more dangerous than no answer. A generic FAQ can outrank the exact SOP. A template can outrank a signed contract. A draft can outrank an approved version.

The retrieval layer must therefore use metadata and source status, not just embedding distance.

Why is vector search alone not enough for enterprise RAG?

Vector search is useful, but enterprise RAG needs hybrid retrieval.

Hybrid search combines keyword and full-text retrieval with vector retrieval. Microsoft's Azure AI Search documentation describes hybrid search as a single query configured for both full-text and vector queries, running them in parallel and merging results using Reciprocal Rank Fusion. It also notes that keyword search performs better for product codes, specialized jargon, dates, and people's names.

That matches production reality.

Enterprise users search for things like invoice numbers, SKU codes, policy IDs, ticket IDs, machine model numbers, clause references, employee names, locations, dates, regulation numbers, internal acronyms, and exact error messages.

Vector search can blur these. Keyword search can preserve them.

But keyword search alone fails when users ask conceptual questions in natural language.

So production retrieval usually needs a layered pipeline:

User query
  ↓
Identity + tenant + role resolution
  ↓
Query normalization
  ↓
Metadata filter construction
  ↓
Keyword retrieval        Vector retrieval
  ↓                        ↓
Candidate merge
  ↓
Deduplication
  ↓
Reranking
  ↓
Freshness + authority scoring
  ↓
Context assembly
  ↓
Answer generation with citations
  ↓
Audit log + feedback capture

This is no longer "call vector DB and pass top-k to the model."

It is a retrieval system.

Vector search vs hybrid search

Dimension	Vector search	Hybrid search
Best for	Conceptual similarity	Mixed exact + semantic retrieval
Weakness	Can miss exact identifiers	More complex scoring and tuning
Handles well	"How do I inspect a machine before use?"	"Inspection checklist for MX-204 machine revision 3"
Production role	One signal	Default retrieval strategy
Risk if used alone	Similar but wrong chunks	More complexity, but better control

Hybrid search is not an optimization. At enterprise scale, it is a correctness requirement.

Why does stale data become a production RAG risk?

Stale data is one of the most underestimated RAG failure modes.

In a prototype, documents are uploaded once. In an enterprise, source data changes continuously: policies are updated, SOPs are revised, contracts are amended, user permissions change, folders move, product catalogs change, tickets are resolved, regulations are updated, reports are regenerated, and old documents are archived.

If the RAG index does not reflect those changes, the answer may be generated from stale evidence.

This is dangerous because the model may still sound confident.

Stale RAG is worse than normal search failure in regulated or operational workflows because it can create false assurance. A user may act on an answer that cites an outdated policy, superseded SOP, or old contract clause.

The freshness problem has multiple layers

Freshness is not one timestamp. A production RAG system needs to track several freshness dimensions:

Layer	Freshness question
Source	When did the source system change?
Ingestion	When did we ingest or sync it?
Extraction	Was text, OCR, or table extraction regenerated after source change?
Embedding	Were embeddings regenerated after content change?
Permission metadata	Were ACL or RBAC changes synced?
Versioning	Is this document current, archived, or superseded?
Cache	Is the answer or retrieval cache invalid?
Citation	Does the citation point to the current source?

Microsoft's document-level access control documentation highlights a related issue: permission changes in source systems are reflected in search results only after permission metadata is synchronized into the index.

The same principle applies to content freshness. If the source changed but the index did not, retrieval is stale. If permissions changed but the permission index did not, access control is stale.

Minimum freshness model for production RAG

At minimum, store these fields:

document_id
source_system
source_uri
source_updated_at
ingested_at
extracted_at
embedded_at
embedding_model
embedding_version
content_hash
extraction_version
document_version
is_current
superseded_by_document_id
approval_status
valid_from
valid_until
tenant_id
access_policy_version

This gives you the ability to answer two critical audit questions:

Was the retrieved source current at the time of the answer?
Was the user allowed to retrieve it at the time of the answer?

Without that, citations are weak proof. A citation says where the text came from. It does not automatically prove the source was current, approved, or authorized.

Why must access boundaries be enforced before retrieval?

Access control is the section where enterprise RAG becomes serious.

A bad RAG design does this:

Retrieve globally
  ↓
Put chunks into prompt
  ↓
Tell model not to reveal unauthorized data

That is not access control. Once unauthorized content enters the prompt, the boundary has already failed.

The correct pattern is:

Resolve user identity
  ↓
Resolve tenant, role, team, region, project, document permissions
  ↓
Apply hard filters before retrieval
  ↓
Retrieve only eligible chunks
  ↓
Rerank only eligible chunks
  ↓
Generate answer only from eligible chunks
  ↓
Log permission context

A model cannot be trusted to "unsee" unauthorized context.

OWASP's 2025 LLM Top 10 includes vector and embedding weaknesses, warning that weaknesses in how vectors and embeddings are generated, stored, or retrieved can be exploited to inject harmful content, manipulate outputs, or access sensitive information in RAG systems.

This is why access control must sit inside retrieval, not only around the final answer. This same principle is the foundation of secure AI agent architecture — never put unauthorized data into the model's context in the first place.

The access problem is harder than tenant_id

A strict tenant_id is necessary, but insufficient. Enterprise access often depends on tenant, organization, business unit, user role, project, region, document classification, source-system permission, group membership, customer account assignment, data residency, legal hold, sensitivity label, and time-bounded access.

A production RAG index must carry permission metadata at the same granularity as retrieval. If you retrieve chunks, permissions must apply to chunks. If you retrieve rows, permissions must apply to rows. If you retrieve document sections, permissions must survive sectioning.

Microsoft's Azure AI Search documentation describes document-level access control patterns including security filters, ACL/RBAC scopes, sensitivity labels, and source ACLs, with query-time enforcement trimming results to documents the caller is authorized to read.

The general design principle is portable beyond any single vendor:

Do not retrieve first and secure later. Secure the candidate set before retrieval.

Access-control failure modes

Failure mode	Example	Safer design
Cross-tenant retrieval	Tenant A retrieves Tenant B chunk	Physical or logical tenant isolation + hard filters
Role leakage	Sales user sees finance document	Role-aware metadata filter before search
Stale permissions	User removed from group but index still allows access	Permission sync + access_policy_version
Chunk loses parent ACL	Chunk table does not carry document permissions	Project ACL to every retrievable unit
Prompt-based security	"Do not reveal confidential info"	Never put unauthorized info in context
Tool bypass	SDK or API retrieves without policy layer	Central retrieval service with enforced policy
Cache leakage	Cached answer reused for another user	Cache key includes user, tenant, permission scope

In enterprise RAG, access control is not a UI feature. It is part of the retrieval algorithm.

Why do OCR, tables, and PDFs quietly damage RAG quality?

Most enterprise knowledge is not clean Markdown. It is trapped in PDFs, scans, spreadsheets, reports, invoices, SOPs, forms, tables, images, email exports, contracts, intranet pages, ERP screens, and old document templates.

This creates a hidden ingestion problem. The retrieval layer can only search what the ingestion layer preserved.

If OCR corrupts the text, embeddings encode corrupted text. If table headers are lost, table rows become meaningless. If merged cells are flattened incorrectly, the answer may cite the wrong value. If a table is split across pages, the system may retrieve only half the evidence.

Azure's RAG documentation explicitly lists large documents, images and PDFs, OCR, document extraction, chunking, terminology mismatches, hybrid queries, and semantic ranking as content-preparation and relevance concerns for RAG.

That is the boring but critical truth: ingestion quality is retrieval quality.

The table problem

Tables are especially dangerous. A table row often depends on column headers, section title, page title, merged cell labels, the preceding paragraph, a footnote, units, version, and page number.

If you embed only the row text, meaning disappears.

Bad representation:

10 | 20 | Yes | No | 4.5

Better representation:

Document: Machine Inspection SOP v3
Section: Visual Inspection Checklist
Table: Required inspection items before machine start
Row meaning:
Item: Belt tension
Required before start: Yes
Acceptable range: 4.5 mm
Applies to: MX-204
Page: 7

For RAG, a table row should become a self-contained fact while preserving raw source references.

Layer	Purpose
Raw file	Legal and source artifact
Raw OCR / extraction JSON	Auditability and reprocessing
Normalized text	Basic search
Semantic Markdown	Human-readable retrieval context
Table CSV / JSON	Structured table retrieval
Row-level facts	Precise QA over tables
Entities and relationships	Cross-document linking
Chunk embeddings	Semantic retrieval
Metadata	Filters, versioning, permissions, provenance

What production RAG architecture works at enterprise scale?

The architecture that works is usually less glamorous than teams expect.

Do not start with an exotic graph database, multi-agent retriever, and five orchestration frameworks. Start with a boring, inspectable architecture:

Source systems  →  Ingestion queue  →  Extraction pipeline
(S3, uploads,      (SQS-style queue,    (OCR, PDF parsing,
 intranet,          retries, DLQ,        table extraction,
 databases,         worker autoscaling)  metadata, hashing)
 ERP/CRM/ticketing)
        ↓
Storage
S3 for raw files and extracted artifacts
Postgres/Aurora for metadata, chunks, facts, relationships
pgvector initially, or Qdrant when scale/isolation/filtering requires it
        ↓
Indexing
full-text index   - vector index
metadata index    - permission index
        ↓
Retrieval service
identity resolution   - tenant + RBAC/ACL filters
keyword search        - vector search
reranking             - freshness scoring
        ↓
Generation service
context assembly      - source citations
answer generation     - refusal / no-answer behavior
        ↓
Observability and audit
query logs    - retrieved + filtered chunks
citations     - latency    - feedback + correction loop

Why this architecture works

It separates the system into layers with different responsibilities.

Layer	Responsibility
Source connectors	Pull or receive data without corrupting source authority
Ingestion queue	Decouple document arrival from processing
Extraction pipeline	Convert messy source files into structured artifacts
Metadata store	Preserve version, tenant, permissions, lineage, status
Vector index	Support semantic retrieval
Full-text index	Support exact retrieval
Permission engine	Prevent unauthorized candidates from entering retrieval
Reranker	Improve final evidence selection
Generation layer	Answer from selected evidence
Audit layer	Explain what happened later

This architecture also supports gradual migration. An MVP can start with SQLite + FAISS. The next step can be Postgres + pgvector. Later, if vector scale, filtering, recall, or multi-tenant isolation requires it, move the vector layer behind an abstraction to Qdrant, Weaviate, Milvus, OpenSearch, Azure AI Search, or another retrieval engine.

pgvector itself supports vector search in Postgres, with HNSW and IVFFlat index types that involve different memory, build-time, and speed/recall tradeoffs.

The key is not to marry the application to one vector store too early. Use a VectorStore or Retriever abstraction from the beginning.

Should you use pgvector, Qdrant, FAISS, or a managed search service?

The answer depends on the stage — and the vector store is one line item in the wider enterprise LLM deployment cost model.

Stage	Sensible choice	Why
Local prototype	FAISS + SQLite	Fast, cheap, simple
Early SaaS / MVP	Postgres + pgvector	One operational database, easier metadata joins
Multi-tenant production	Qdrant / OpenSearch / managed vector DB	Better filtering, isolation, operations, scaling
Compliance-heavy enterprise	Search with document-level ACL support	Access control becomes central
Very complex navigation	Agentic retrieval over existing search	Useful when one-shot retrieval is insufficient

The mistake is not choosing FAISS, pgvector, or Qdrant. The mistake is choosing any of them without deciding how you will enforce tenant isolation, metadata filters, version filters, permission filters, freshness checks, audit logging, reranking, and evaluation.

A vector database is not a governance layer.

When should you add a graph database?

Not early. A graph model is useful when the product genuinely needs real-time multi-hop traversal:

"Which SOPs are affected by this regulation change?"
"Which machines, procedures, and inspection findings connect to this failure?"
"Which contract obligations depend on this customer region and product line?"

But storing entities and relationships does not automatically require a graph database. In many enterprise RAG systems, Postgres tables for entities, relationships, facts, and events are enough for a long time.

Move to a graph database only when query patterns demand it. Do not add a graph database because RAG sounds relational. Add it because users need graph traversal that relational queries cannot support cleanly under production latency.

How should production RAG handle compliance workflows?

Compliance RAG should not behave like generic chat over documents. In compliance workflows, wrong retrieval has higher consequences. The answer must distinguish regulation, internal policy, SOP, evidence, finding, interpretation, recommendation, and open question.

A safer compliance retrieval pattern is:

User compliance question
  ↓
Check structured findings first
  ↓
Check approved policy / regulation mappings
  ↓
Retrieve narrow supporting evidence
  ↓
Generate answer with exact citations
  ↓
Flag uncertainty or missing source
  ↓
Log answer for audit

Do not always start by retrieving raw chunks. If a system has already produced compliance findings from SOPs, regulations, and enforcement documents, the first retrieval layer should check those structured findings before falling back to general RAG. This reduces the chance that a generic but similar paragraph outranks a validated finding.

Compliance RAG must support refusal

A compliance RAG system should be able to say:

"No approved source found."
"The retrieved source is outdated."
"This answer depends on a policy version that has been superseded."
"This user does not have access to the required evidence."
"The source documents conflict."
"This requires human review."

This is not a weakness. In enterprise workflows, controlled refusal is a feature.

NIST's AI Risk Management Framework and Generative AI Profile are useful reference points for thinking about AI risk management, especially around governance and risk identification for generative AI systems.

How should you evaluate production RAG quality?

Do not evaluate only the final answer. Evaluate the retrieval pipeline separately.

A production RAG system needs at least five evaluation layers:

Evaluation layer	What to measure
Retrieval recall	Did the correct source appear in the candidate set?
Ranking quality	Did the correct source survive into final top-k?
Context quality	Was the final context sufficient and non-conflicting?
Answer faithfulness	Did the answer stay grounded in the retrieved evidence?
Operational quality	Was the answer fast, authorized, current, and traceable?

Most teams skip the second and third layers. That is where production failures hide.

Build a retrieval test set from real misses

Start with real queries, not synthetic perfect questions. Include short queries, ambiguous queries, internal acronyms, wrong terminology, exact IDs, policy questions, table questions, multi-document questions, role-restricted questions, stale-document traps, and no-answer questions.

For each test query, record:

query
expected_document_id
expected_chunk_ids
acceptable_alternate_sources
required_metadata_filters
required_permission_scope
expected_answer_type
must_refuse_if_no_source
freshness_requirement

Then test keyword only, vector only, hybrid, hybrid + reranking, hybrid + query rewrite, hybrid + metadata filters, and hybrid + freshness scoring. The goal is not to prove one technique is best. The goal is to know which failures each technique catches.

What should retrieval observability capture?

LLM observability is not enough. Production RAG needs retrieval observability. For every answer, you should be able to reconstruct who asked, what they asked, which tenant and access scope was used, which sources were searched, which filters were applied, which chunks were retrieved, which chunks were filtered out, which chunks were reranked, which chunks entered the prompt, which citations appeared, whether sources were current, how long each stage took, what the model answered, whether the user gave feedback, and whether the answer was later corrected.

Minimum log schema:

retrieval_event_id
answer_event_id
tenant_id
user_id
role_scope
query_text
normalized_query
filters_applied
source_systems_searched
candidate_chunk_ids
candidate_scores
reranked_chunk_ids
final_context_chunk_ids
citations_used
freshness_status
permission_policy_version
latency_keyword_ms
latency_vector_ms
latency_rerank_ms
latency_generation_ms
total_latency_ms
answer_confidence_label
feedback_score
correction_required

This is the difference between a demo and a system. A demo answers. A system explains how the answer happened.

What worked, what did not, and what to do next?

Several things were directionally right in the journey.

Building without LangChain gave control. Frameworks are useful, but early direct implementation made the data model visible. The system stored not only chunks but entities, relationships, facts, and events, which made it easier to reason about retrieval modes beyond top-k vector similarity. This relates closely to the difference between agent memory and state — what you persist determines what you can later retrieve and trust.

SQLite + FAISS was good for MVP. It is not enterprise-ready by itself, but it forces clarity on chunk schema, embedding lifecycle, local search, metadata gaps, and retrieval behavior. The mistake would be to confuse MVP success with production readiness.

Hybrid retrieval became the right default. The retrieval miss around "list of items" exposed a real issue: user language and document language diverge. Hybrid retrieval combines semantic meaning with exact matching.

Queue-based ingestion was the right complexity level. For this scale, Kafka was unnecessary. A queue with workers, retries, visibility timeouts, and a dead-letter queue is enough for many early production RAG systems. Heavy OCR, PDF, and LLM ingestion should run in workers, not synchronous request logic.

What did not work or would not scale:

Pure vector retrieval. It misses exact identifiers, struggles with short operational queries, and cannot decide authority or freshness.
Treating FAISS + SQLite as production architecture. Production needs concurrent ingestion, query isolation, metadata filtering, permission enforcement, backup and restore, index lifecycle, observability, and operational ownership.
Embedding every keystroke. For live search UI, use prefix and autocomplete for as-you-type, run semantic retrieval only on submit or a debounce threshold, cache query embeddings, and precompute document embeddings.
Adding a graph DB too early. Start with relational facts and relationships; move to graph only when traversal becomes a core query path.
Filtering unauthorized data after retrieval. This is the most serious failure. If unauthorized chunks enter retrieval or the prompt, the boundary has already been crossed. The same lesson appears in agent design: tool output is not instruction, and untrusted content should never silently gain authority.

A realistic production RAG roadmap

Make retrieval observable. Before changing models, log the query, filters, retrieved chunk IDs, scores, source versions, user and tenant, final context, citations, latency by stage, and feedback. Without this, you are debugging blind.
Fix ingestion and chunking. Improve PDF extraction, OCR correction, table preservation, heading-to-body linking, document versioning, metadata extraction, and duplicate detection.
Move to hybrid retrieval. Add keyword search, vector search, metadata filters, domain synonyms, reranking, and deduplication. Measure retrieval recall and final top-k quality separately.
Add freshness and access control. Add tenant filters, role filters, source permission sync, document version tracking, content hashes, embedding versions, access-policy versioning, and cache invalidation. This is the enterprise boundary.
Add advanced retrieval only where justified. Query rewriting, multi-query retrieval, graph expansion, agentic retrieval, and findings-first compliance retrieval come last — never before the basic retrieval system is measurable.

---

Production RAG checklist

Use this before calling a RAG system production-ready.

Retrieval quality

Hybrid retrieval supports keyword and vector search.
Metadata filters are applied before reranking.
Reranker is evaluated separately.
A retrieval test set exists and missed-query logs are reviewed.
Exact identifiers, short queries, and ambiguous queries are tested.
Duplicate chunks are detected and handled.

Freshness

Source updated, ingested, extracted, and embedded timestamps are stored.
Embedding model and version and a content hash are stored.
Document version is stored and superseded documents are excluded by default.
Cache invalidation rules exist.

Access control

tenant_id exists on every retrievable unit.
Role, group, and project access is represented in metadata.
Permission filters run before retrieval.
Unauthorized chunks never enter prompt context.
Cache key includes tenant, user, and access scope.
Permission sync is monitored and cross-tenant tests exist.

Ingestion

Raw source files and raw extraction JSON are preserved.
Tables are stored as structured data and semantic Markdown is generated.
Failed ingestion goes to a DLQ and reprocessing is possible.
Extraction version is tracked.

Observability and governance

Query, retrieved-chunk, filter, and citation logs exist.
Latency by stage is measured and a feedback loop exists.
Corrections update evaluation cases and drift monitoring exists.
No-answer and conflicting-source behavior is defined.
Sensitive workflows support human review and audit export is possible.

---

Frequently Asked Questions About RAG in Production

What is RAG in production?

RAG in production is a retrieval-augmented generation system that is reliable enough for real users, changing data, access control, monitoring, latency constraints, and auditability. It is not just a vector database connected to an LLM.

Why does RAG fail in enterprise environments?

RAG fails in enterprises because retrieval quality, stale data, access boundaries, OCR quality, metadata, and observability are often treated as secondary problems. In reality, these are the core production problems.

Is vector search enough for production RAG?

No. Vector search is useful for semantic similarity, but production RAG usually needs hybrid search: keyword search, vector search, metadata filtering, reranking, and freshness or authority checks.

How should access control work in RAG?

Access control should be enforced before retrieval. The system should resolve the user's tenant, role, groups, and document permissions first, then retrieve only the chunks that user is allowed to see.

How do you prevent stale answers in RAG?

Track source version, content hash, ingestion time, extraction time, embedding time, document status, and supersession relationships. Exclude outdated documents by default unless the user explicitly asks for historical versions.

Should production RAG use a graph database?

Only when real-time multi-hop traversal is a core requirement. Many systems can start with relational tables for entities, relationships, facts, and events before adding graph infrastructure.

What is the biggest mistake in enterprise RAG architecture?

The biggest mistake is treating RAG as an LLM feature instead of an enterprise retrieval system. The model is only the final layer; the hard work is ingestion, metadata, access control, retrieval quality, freshness, and auditability.

How should teams start improving a slow or inaccurate RAG system?

Start by logging retrieval behavior. Capture the query, filters, retrieved chunks, scores, reranked chunks, final context, citations, source freshness, permissions, and latency. Without retrieval observability, model changes are mostly guesswork.

Key Takeaways

RAG in production is an enterprise retrieval governance problem, not just a vector-search problem.
Retrieval quality usually breaks before generation quality.
Similarity is not authority; the system must know source freshness, approval status, and applicability.
Access control must be enforced before retrieval, not after generation.
OCR, tables, PDFs, and document structure can silently damage answer quality.
Hybrid search is the practical default for enterprise RAG.
A boring architecture — object storage, relational metadata, vector index, queue-based ingestion, hard filters, audit logs — is usually the right starting point.
The system must explain not only the answer, but why those sources were eligible to answer.

---

For teams building internal AI systems, the useful review is not "which vector database should we use?" The useful question is: what sources are eligible, what permissions apply, what data is stale, what retrieval failures are happening, what evidence enters the prompt, what is logged, and what can be audited.

If you want a second opinion on a RAG or enterprise AI architecture, get in touch — or read the Designing Secure AI Agents series, which covers the same governance boundary from the agent side.

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — original RAG research.
Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases — AWS RAG, retrieval, and citations.
RAG and Generative AI — Azure AI Search — content preparation, chunking, OCR, hybrid queries, semantic ranking.
Hybrid Search Overview — Azure AI Search — full-text + vector search and Reciprocal Rank Fusion.
Document-Level Access Control — Azure AI Search — query-time access trimming and permission sync.
OWASP LLM08:2025 Vector and Embedding Weaknesses.
NIST AI Risk Management Framework and Generative AI Profile.
pgvector — IVFFlat and HNSW index tradeoffs.
Scaling Retrieval Augmented Generation with RAG Fusion — production-style study on recall, reranking, context limits, and latency.

Part of the series

Enterprise AI Strategy

1.AI Adoption Is an Operating-Model Change, Not a Software Installation
2.Enterprise AI Operating Model: Who Owns AI After the Pilot?
3.How to Prioritise AI Use Cases by Value, Feasibility and Risk
4.Data Readiness for Enterprise AI: What Ready Actually Means
5.From AI Pilot to Production: The Twelve Gates That Prevent Expensive Failure
6.The AI Architecture Review: What a CTO Should Demand Before Productioncoming soon
7.How Enterprises Evaluate LLM Features Before Shipping: Evals, Regression Tests, and Acceptance Criteria
8.RAG in Production: What Breaks at Enterprise Scale← you are here
9.AI Governance Without Turning the AI Team into a Committeecoming soon
10.Managed AI Operations: What Happens After the Agent Goes Livecoming soon
11.AI Incident Management: When an Agent Makes the Wrong Decisioncoming soon
12.How Executives Should Review an AI Programme Every Monthcoming soon
13.AI FinOps: A Practical Framework to Control Enterprise AI Cost Without Killing Adoption
14.Build vs Buy vs Platform: A Decision Framework for Enterprise AI Agentscoming soon
15.AI Vendor Due Diligence: Questions to Ask Before Signingcoming soon

View full series →

AITechnologySeriesJune 14, 2026