RAG in Production: What Breaks at Enterprise Scale

By Aakash Ahuja··29 min read

RAG in production does not usually fail because the language model cannot write a good answer. It fails because the retrieval layer is allowed to behave like a demo search box inside an enterprise.

Retrieval-Augmented Generation, or RAG, is the pattern where an LLM retrieves external knowledge before generating an answer. The original RAG work combined a parametric model with non-parametric memory retrieved from a dense index, and modern cloud platforms describe RAG as a way to ground model output in external data sources rather than relying only on model training data.

That definition is accurate, but incomplete for production.

At enterprise scale, RAG is not just "LLM + vector database." It becomes a governed retrieval system that must decide which data is eligible, current, authorized, precise, traceable, and safe enough to enter the model context.

The hard production question is not:

Can we retrieve similar chunks?

The hard question is:

Can we retrieve the right evidence, from the right version, for the right user, under the right access boundary, fast enough, with enough traceability to defend the answer later?

That is where most RAG systems start breaking.


Table of Contents

  • Why RAG prototypes work before they fail
  • The real setup: from SQLite + FAISS to enterprise constraints
  • What breaks first: retrieval quality
  • Why vector search alone is not enough
  • Why stale data becomes a production risk
  • Why access boundaries must be enforced before retrieval
  • Why OCR, tables, and PDFs quietly damage RAG quality
  • What architecture works for enterprise RAG
  • How to evaluate production RAG
  • What worked, what did not, and what to do next
  • Production RAG checklist
  • FAQ
  • Key takeaways
---

Why do RAG prototypes work before they fail?

A RAG prototype usually works because the environment is forgiving.

The dataset is small. The documents are mostly clean. The users are trusted. The question set is narrow. The data does not change much. Nobody is checking whether the answer came from the latest version of the policy. Nobody is asking whether a sales user accidentally retrieved a finance-only document.

That is why the first demo feels magical.

You upload files, chunk them, create embeddings, store them in FAISS or a vector database, ask a question, retrieve top-k chunks, pass them to the LLM, and get a grounded answer.

For a demo, this is enough.

For an enterprise, this is the beginning of the problem.

A production RAG system has to survive conditions that the prototype avoided:

Prototype assumptionEnterprise reality
All users can see all documentsUsers have roles, teams, regions, tenants, and exceptions
Documents are staticPolicies, contracts, SOPs, catalogs, tickets, and reports change
Text extraction is cleanPDFs, scans, tables, merged cells, images, and forms break structure
Top-k similarity is enoughSimilarity does not prove authority, freshness, or applicability
One index is manageableMulti-tenant systems need isolation, filtering, versioning, and audit trails
Good answer means good systemThe system must prove which sources were used and why
Latency does not matter muchProduction users expect seconds, not minutes
The first serious realization is this:

RAG is easy when all data is equally trusted. Enterprise RAG starts when data is not equally visible, current, or authoritative.
Production RAG reference architecture showing source connectors, ingestion queue, extraction pipeline, metadata store, vector and full-text indexes, permission-aware retrieval, answer generation, and an audit and governance layer.
Production RAG reference architecture showing source connectors, ingestion queue, extraction pipeline, metadata store, vector and full-text indexes, permission-aware retrieval, answer generation, and an audit and governance layer.

What was the original setup, and why was it not enough?

One practical journey started with a custom RAG system built without LangChain.

The early stack was intentionally simple:

  • SQLite for chunks, entities, relationships, facts, and events.
  • FAISS for vector search.
  • OpenAI for embeddings and generation.
  • A custom retrieval pipeline.
  • File ingestion with OCR and table extraction.
  • No Kafka.
  • AWS as the likely production environment.
  • India-only storage as a deployment constraint.
  • Tenant separation as a known future requirement.
This was a good MVP architecture.

It gave control over the pipeline. It made the data model visible. It avoided framework magic. It allowed experimentation with chunks, entities, relationships, facts, and events instead of reducing everything to anonymous text blocks.

But the production constraints were very different:

  • ingest around 100 documents per day,
  • batch ingest around 10 documents per minute,
  • handle OCR, table extraction, and LLM structuring,
  • serve around 100 QPS for retrieval,
  • support hybrid search,
  • isolate tenants,
  • keep response latency low,
  • monitor accuracy and drift,
  • preserve traceability,
  • avoid cross-tenant leakage,
  • and reduce a full answer path that was taking roughly two minutes.
That is the gap between a RAG MVP and production RAG.

The MVP proves the retrieval pattern. Production proves the operating system around it.


What breaks first in RAG in production?

Retrieval quality breaks first.

Not generation. Not the LLM. Retrieval.

The model can only answer from the evidence it receives. If the right chunk is not retrieved, the model has three bad options:

  • answer from incomplete context,
  • hallucinate missing details,
  • refuse or give a vague answer.
A capable LLM cannot compensate for systematically bad retrieval.

A 2026 production-style RAG fusion study makes the same point in a different way: retrieval improvements do not automatically translate into better end-to-end answers once reranking limits, context-window limits, and latency constraints enter the system.

That observation matters because many teams optimize isolated retrieval metrics and assume the answer quality will improve. In production, extra recall can get neutralized by reranking, deduplication, truncation, conflicting chunks, or latency limits.

Failure mode 1: the right content exists but is not retrieved

One real retrieval problem looked like this:

Query: "Give me list of items needed for visual inspection of machine."

The required content existed, but retrieval missed it because the system did not match the phrase "list of items" to the relevant procedural section.

This is a common enterprise problem.

Users do not ask questions using the exact language of SOPs, manuals, or engineering documents. They ask in operational language:

  • "items needed"
  • "things to check"
  • "what do I inspect"
  • "machine visual inspection"
  • "pre-check list"
  • "before starting"
  • "required materials"
A pure vector search may catch some of this. A keyword search may catch some of this. Neither is enough by itself.

The fix is not "use a better embedding model" as the first response. The better response is to redesign retrieval:

  • preserve section headings,
  • capture document type and process step metadata,
  • use hybrid keyword + vector search,
  • add domain synonyms,
  • use query rewriting carefully,
  • rerank after candidate retrieval — a reranker is a second-pass model that reorders the retrieved candidates by true relevance before the top few reach the LLM,
  • log missed-answer cases,
  • and build a retrieval test set from real user queries.

Failure mode 2: chunks are not the unit of meaning

Many RAG systems treat chunks as if they are naturally meaningful.

They are not.

A chunk is often just an artifact of token limits.

In enterprise documents, the actual unit of meaning may be a policy clause, a procedure step, a row in a table, a field in a form, a machine inspection checklist, a contract obligation, a report finding, a support resolution, or a regulation-to-SOP mapping.

If chunking cuts across those boundaries, retrieval quality is damaged before embeddings are created.

Bad chunking creates several problems:

Bad chunking patternProduction impact
Heading separated from bodyRetrieved text loses meaning
Table row separated from headerValues become ambiguous
Clause split across chunksAnswer misses conditions and exceptions
Large chunk with many topicsSimilarity score becomes noisy
Tiny chunk with no contextReranker cannot judge relevance
Duplicated boilerplate chunksGeneric text outranks specific answer
Old and new versions both indexedStale source may win retrieval
Chunking is not preprocessing. Chunking is part of the reasoning boundary.

Failure mode 3: similar does not mean authoritative

Vector search finds semantic similarity.

It does not know whether a retrieved chunk is current, approved, superseded, tenant-specific, region-specific, role-visible, legally binding, draft, archived, or contradicted by a newer document.

This is why one of the strongest rules for production RAG is:

Similarity is not authority.

A highly similar old policy can be more dangerous than no answer. A generic FAQ can outrank the exact SOP. A template can outrank a signed contract. A draft can outrank an approved version.

The retrieval layer must therefore use metadata and source status, not just embedding distance.


Why is vector search alone not enough for enterprise RAG?

Vector search is useful, but enterprise RAG needs hybrid retrieval.

Hybrid search combines keyword and full-text retrieval with vector retrieval. Microsoft's Azure AI Search documentation describes hybrid search as a single query configured for both full-text and vector queries, running them in parallel and merging results using Reciprocal Rank Fusion. It also notes that keyword search performs better for product codes, specialized jargon, dates, and people's names.

That matches production reality.

Enterprise users search for things like invoice numbers, SKU codes, policy IDs, ticket IDs, machine model numbers, clause references, employee names, locations, dates, regulation numbers, internal acronyms, and exact error messages.

Vector search can blur these. Keyword search can preserve them.

But keyword search alone fails when users ask conceptual questions in natural language.

So production retrieval usually needs a layered pipeline:

User query
  ↓
Identity + tenant + role resolution
  ↓
Query normalization
  ↓
Metadata filter construction
  ↓
Keyword retrieval        Vector retrieval
  ↓                        ↓
Candidate merge
  ↓
Deduplication
  ↓
Reranking
  ↓
Freshness + authority scoring
  ↓
Context assembly
  ↓
Answer generation with citations
  ↓
Audit log + feedback capture

This is no longer "call vector DB and pass top-k to the model."

It is a retrieval system.

Vector search vs hybrid search

DimensionVector searchHybrid search
Best forConceptual similarityMixed exact + semantic retrieval
WeaknessCan miss exact identifiersMore complex scoring and tuning
Handles well"How do I inspect a machine before use?""Inspection checklist for MX-204 machine revision 3"
Production roleOne signalDefault retrieval strategy
Risk if used aloneSimilar but wrong chunksMore complexity, but better control
Hybrid search is not an optimization. At enterprise scale, it is a correctness requirement.


Why does stale data become a production RAG risk?

Stale data is one of the most underestimated RAG failure modes.

In a prototype, documents are uploaded once. In an enterprise, source data changes continuously: policies are updated, SOPs are revised, contracts are amended, user permissions change, folders move, product catalogs change, tickets are resolved, regulations are updated, reports are regenerated, and old documents are archived.

If the RAG index does not reflect those changes, the answer may be generated from stale evidence.

This is dangerous because the model may still sound confident.

Stale RAG is worse than normal search failure in regulated or operational workflows because it can create false assurance. A user may act on an answer that cites an outdated policy, superseded SOP, or old contract clause.

The freshness problem has multiple layers

Freshness is not one timestamp. A production RAG system needs to track several freshness dimensions:

LayerFreshness question
SourceWhen did the source system change?
IngestionWhen did we ingest or sync it?
ExtractionWas text, OCR, or table extraction regenerated after source change?
EmbeddingWere embeddings regenerated after content change?
Permission metadataWere ACL or RBAC changes synced?
VersioningIs this document current, archived, or superseded?
CacheIs the answer or retrieval cache invalid?
CitationDoes the citation point to the current source?
Microsoft's document-level access control documentation highlights a related issue: permission changes in source systems are reflected in search results only after permission metadata is synchronized into the index.

The same principle applies to content freshness. If the source changed but the index did not, retrieval is stale. If permissions changed but the permission index did not, access control is stale.

Minimum freshness model for production RAG

At minimum, store these fields:

document_id
source_system
source_uri
source_updated_at
ingested_at
extracted_at
embedded_at
embedding_model
embedding_version
content_hash
extraction_version
document_version
is_current
superseded_by_document_id
approval_status
valid_from
valid_until
tenant_id
access_policy_version

This gives you the ability to answer two critical audit questions:

  • Was the retrieved source current at the time of the answer?
  • Was the user allowed to retrieve it at the time of the answer?
Without that, citations are weak proof. A citation says where the text came from. It does not automatically prove the source was current, approved, or authorized.


Why must access boundaries be enforced before retrieval?

Access control is the section where enterprise RAG becomes serious.

A bad RAG design does this:

Retrieve globally
  ↓
Put chunks into prompt
  ↓
Tell model not to reveal unauthorized data

That is not access control. Once unauthorized content enters the prompt, the boundary has already failed.

The correct pattern is:

Resolve user identity
  ↓
Resolve tenant, role, team, region, project, document permissions
  ↓
Apply hard filters before retrieval
  ↓
Retrieve only eligible chunks
  ↓
Rerank only eligible chunks
  ↓
Generate answer only from eligible chunks
  ↓
Log permission context

A model cannot be trusted to "unsee" unauthorized context.

OWASP's 2025 LLM Top 10 includes vector and embedding weaknesses, warning that weaknesses in how vectors and embeddings are generated, stored, or retrieved can be exploited to inject harmful content, manipulate outputs, or access sensitive information in RAG systems.

This is why access control must sit inside retrieval, not only around the final answer. This same principle is the foundation of secure AI agent architecture — never put unauthorized data into the model's context in the first place.

The access problem is harder than tenant_id

A strict tenant_id is necessary, but insufficient. Enterprise access often depends on tenant, organization, business unit, user role, project, region, document classification, source-system permission, group membership, customer account assignment, data residency, legal hold, sensitivity label, and time-bounded access.

A production RAG index must carry permission metadata at the same granularity as retrieval. If you retrieve chunks, permissions must apply to chunks. If you retrieve rows, permissions must apply to rows. If you retrieve document sections, permissions must survive sectioning.

Microsoft's Azure AI Search documentation describes document-level access control patterns including security filters, ACL/RBAC scopes, sensitivity labels, and source ACLs, with query-time enforcement trimming results to documents the caller is authorized to read.

The general design principle is portable beyond any single vendor:

Do not retrieve first and secure later. Secure the candidate set before retrieval.

Access-control failure modes

Failure modeExampleSafer design
Cross-tenant retrievalTenant A retrieves Tenant B chunkPhysical or logical tenant isolation + hard filters
Role leakageSales user sees finance documentRole-aware metadata filter before search
Stale permissionsUser removed from group but index still allows accessPermission sync + access_policy_version
Chunk loses parent ACLChunk table does not carry document permissionsProject ACL to every retrievable unit
Prompt-based security"Do not reveal confidential info"Never put unauthorized info in context
Tool bypassSDK or API retrieves without policy layerCentral retrieval service with enforced policy
Cache leakageCached answer reused for another userCache key includes user, tenant, permission scope
In enterprise RAG, access control is not a UI feature. It is part of the retrieval algorithm.


Why do OCR, tables, and PDFs quietly damage RAG quality?

Most enterprise knowledge is not clean Markdown. It is trapped in PDFs, scans, spreadsheets, reports, invoices, SOPs, forms, tables, images, email exports, contracts, intranet pages, ERP screens, and old document templates.

This creates a hidden ingestion problem. The retrieval layer can only search what the ingestion layer preserved.

If OCR corrupts the text, embeddings encode corrupted text. If table headers are lost, table rows become meaningless. If merged cells are flattened incorrectly, the answer may cite the wrong value. If a table is split across pages, the system may retrieve only half the evidence.

Azure's RAG documentation explicitly lists large documents, images and PDFs, OCR, document extraction, chunking, terminology mismatches, hybrid queries, and semantic ranking as content-preparation and relevance concerns for RAG.

That is the boring but critical truth: ingestion quality is retrieval quality.

The table problem

Tables are especially dangerous. A table row often depends on column headers, section title, page title, merged cell labels, the preceding paragraph, a footnote, units, version, and page number.

If you embed only the row text, meaning disappears.

Bad representation:

10 | 20 | Yes | No | 4.5

Better representation:

Document: Machine Inspection SOP v3
Section: Visual Inspection Checklist
Table: Required inspection items before machine start
Row meaning:
  • Item: Belt tension
  • Required before start: Yes
  • Acceptable range: 4.5 mm
  • Applies to: MX-204
  • Page: 7

For RAG, a table row should become a self-contained fact while preserving raw source references.

Recommended multi-layer document representation

For serious document ingestion, store multiple derived forms:

LayerPurpose
Raw fileLegal and source artifact
Raw OCR / extraction JSONAuditability and reprocessing
Normalized textBasic search
Semantic MarkdownHuman-readable retrieval context
Table CSV / JSONStructured table retrieval
Row-level factsPrecise QA over tables
Entities and relationshipsCross-document linking
Chunk embeddingsSemantic retrieval
MetadataFilters, versioning, permissions, provenance
The mistake is to throw away raw extraction data after creating chunks. Keep raw extraction output. Treat text, Markdown, CSV, facts, and embeddings as derived artifacts that can be regenerated.


What production RAG architecture works at enterprise scale?

The architecture that works is usually less glamorous than teams expect.

Do not start with an exotic graph database, multi-agent retriever, and five orchestration frameworks. Start with a boring, inspectable architecture:

Source systems  →  Ingestion queue  →  Extraction pipeline
(S3, uploads,      (SQS-style queue,    (OCR, PDF parsing,
 intranet,          retries, DLQ,        table extraction,
 databases,         worker autoscaling)  metadata, hashing)
 ERP/CRM/ticketing)
        ↓
Storage
  • S3 for raw files and extracted artifacts
  • Postgres/Aurora for metadata, chunks, facts, relationships
  • pgvector initially, or Qdrant when scale/isolation/filtering requires it
↓ Indexing
  • full-text index - vector index
  • metadata index - permission index
↓ Retrieval service
  • identity resolution - tenant + RBAC/ACL filters
  • keyword search - vector search
  • reranking - freshness scoring
↓ Generation service
  • context assembly - source citations
  • answer generation - refusal / no-answer behavior
↓ Observability and audit
  • query logs - retrieved + filtered chunks
  • citations - latency - feedback + correction loop

Why this architecture works

It separates the system into layers with different responsibilities.

LayerResponsibility
Source connectorsPull or receive data without corrupting source authority
Ingestion queueDecouple document arrival from processing
Extraction pipelineConvert messy source files into structured artifacts
Metadata storePreserve version, tenant, permissions, lineage, status
Vector indexSupport semantic retrieval
Full-text indexSupport exact retrieval
Permission enginePrevent unauthorized candidates from entering retrieval
RerankerImprove final evidence selection
Generation layerAnswer from selected evidence
Audit layerExplain what happened later
This architecture also supports gradual migration. An MVP can start with SQLite + FAISS. The next step can be Postgres + pgvector. Later, if vector scale, filtering, recall, or multi-tenant isolation requires it, move the vector layer behind an abstraction to Qdrant, Weaviate, Milvus, OpenSearch, Azure AI Search, or another retrieval engine.

pgvector itself supports vector search in Postgres, with HNSW and IVFFlat index types that involve different memory, build-time, and speed/recall tradeoffs.

The key is not to marry the application to one vector store too early. Use a VectorStore or Retriever abstraction from the beginning.

Should you use pgvector, Qdrant, FAISS, or a managed search service?

The answer depends on the stage — and the vector store is one line item in the wider enterprise LLM deployment cost model.

StageSensible choiceWhy
Local prototypeFAISS + SQLiteFast, cheap, simple
Early SaaS / MVPPostgres + pgvectorOne operational database, easier metadata joins
Multi-tenant productionQdrant / OpenSearch / managed vector DBBetter filtering, isolation, operations, scaling
Compliance-heavy enterpriseSearch with document-level ACL supportAccess control becomes central
Very complex navigationAgentic retrieval over existing searchUseful when one-shot retrieval is insufficient
The mistake is not choosing FAISS, pgvector, or Qdrant. The mistake is choosing any of them without deciding how you will enforce tenant isolation, metadata filters, version filters, permission filters, freshness checks, audit logging, reranking, and evaluation.

A vector database is not a governance layer.

When should you add a graph database?

Not early. A graph model is useful when the product genuinely needs real-time multi-hop traversal:

  • "Which SOPs are affected by this regulation change?"
  • "Which machines, procedures, and inspection findings connect to this failure?"
  • "Which contract obligations depend on this customer region and product line?"
But storing entities and relationships does not automatically require a graph database. In many enterprise RAG systems, Postgres tables for entities, relationships, facts, and events are enough for a long time.

Move to a graph database only when query patterns demand it. Do not add a graph database because RAG sounds relational. Add it because users need graph traversal that relational queries cannot support cleanly under production latency.


How should production RAG handle compliance workflows?

Compliance RAG should not behave like generic chat over documents. In compliance workflows, wrong retrieval has higher consequences. The answer must distinguish regulation, internal policy, SOP, evidence, finding, interpretation, recommendation, and open question.

A safer compliance retrieval pattern is:

User compliance question
  ↓
Check structured findings first
  ↓
Check approved policy / regulation mappings
  ↓
Retrieve narrow supporting evidence
  ↓
Generate answer with exact citations
  ↓
Flag uncertainty or missing source
  ↓
Log answer for audit

Do not always start by retrieving raw chunks. If a system has already produced compliance findings from SOPs, regulations, and enforcement documents, the first retrieval layer should check those structured findings before falling back to general RAG. This reduces the chance that a generic but similar paragraph outranks a validated finding.

Compliance RAG must support refusal

A compliance RAG system should be able to say:

  • "No approved source found."
  • "The retrieved source is outdated."
  • "This answer depends on a policy version that has been superseded."
  • "This user does not have access to the required evidence."
  • "The source documents conflict."
  • "This requires human review."
This is not a weakness. In enterprise workflows, controlled refusal is a feature.

NIST's AI Risk Management Framework and Generative AI Profile are useful reference points for thinking about AI risk management, especially around governance and risk identification for generative AI systems.


How should you evaluate production RAG quality?

Do not evaluate only the final answer. Evaluate the retrieval pipeline separately.

A production RAG system needs at least five evaluation layers:

Evaluation layerWhat to measure
Retrieval recallDid the correct source appear in the candidate set?
Ranking qualityDid the correct source survive into final top-k?
Context qualityWas the final context sufficient and non-conflicting?
Answer faithfulnessDid the answer stay grounded in the retrieved evidence?
Operational qualityWas the answer fast, authorized, current, and traceable?
Most teams skip the second and third layers. That is where production failures hide.

Build a retrieval test set from real misses

Start with real queries, not synthetic perfect questions. Include short queries, ambiguous queries, internal acronyms, wrong terminology, exact IDs, policy questions, table questions, multi-document questions, role-restricted questions, stale-document traps, and no-answer questions.

For each test query, record:

query
expected_document_id
expected_chunk_ids
acceptable_alternate_sources
required_metadata_filters
required_permission_scope
expected_answer_type
must_refuse_if_no_source
freshness_requirement

Then test keyword only, vector only, hybrid, hybrid + reranking, hybrid + query rewrite, hybrid + metadata filters, and hybrid + freshness scoring. The goal is not to prove one technique is best. The goal is to know which failures each technique catches.

What should retrieval observability capture?

LLM observability is not enough. Production RAG needs retrieval observability. For every answer, you should be able to reconstruct who asked, what they asked, which tenant and access scope was used, which sources were searched, which filters were applied, which chunks were retrieved, which chunks were filtered out, which chunks were reranked, which chunks entered the prompt, which citations appeared, whether sources were current, how long each stage took, what the model answered, whether the user gave feedback, and whether the answer was later corrected.

Minimum log schema:

retrieval_event_id
answer_event_id
tenant_id
user_id
role_scope
query_text
normalized_query
filters_applied
source_systems_searched
candidate_chunk_ids
candidate_scores
reranked_chunk_ids
final_context_chunk_ids
citations_used
freshness_status
permission_policy_version
latency_keyword_ms
latency_vector_ms
latency_rerank_ms
latency_generation_ms
total_latency_ms
answer_confidence_label
feedback_score
correction_required

This is the difference between a demo and a system. A demo answers. A system explains how the answer happened.


What worked, what did not, and what to do next?

Several things were directionally right in the journey.

Building without LangChain gave control. Frameworks are useful, but early direct implementation made the data model visible. The system stored not only chunks but entities, relationships, facts, and events, which made it easier to reason about retrieval modes beyond top-k vector similarity. This relates closely to the difference between agent memory and state — what you persist determines what you can later retrieve and trust.

SQLite + FAISS was good for MVP. It is not enterprise-ready by itself, but it forces clarity on chunk schema, embedding lifecycle, local search, metadata gaps, and retrieval behavior. The mistake would be to confuse MVP success with production readiness.

Hybrid retrieval became the right default. The retrieval miss around "list of items" exposed a real issue: user language and document language diverge. Hybrid retrieval combines semantic meaning with exact matching.

Queue-based ingestion was the right complexity level. For this scale, Kafka was unnecessary. A queue with workers, retries, visibility timeouts, and a dead-letter queue is enough for many early production RAG systems. Heavy OCR, PDF, and LLM ingestion should run in workers, not synchronous request logic.

What did not work or would not scale:

  • Pure vector retrieval. It misses exact identifiers, struggles with short operational queries, and cannot decide authority or freshness.
  • Treating FAISS + SQLite as production architecture. Production needs concurrent ingestion, query isolation, metadata filtering, permission enforcement, backup and restore, index lifecycle, observability, and operational ownership.
  • Embedding every keystroke. For live search UI, use prefix and autocomplete for as-you-type, run semantic retrieval only on submit or a debounce threshold, cache query embeddings, and precompute document embeddings.
  • Adding a graph DB too early. Start with relational facts and relationships; move to graph only when traversal becomes a core query path.
  • Filtering unauthorized data after retrieval. This is the most serious failure. If unauthorized chunks enter retrieval or the prompt, the boundary has already been crossed. The same lesson appears in agent design: tool output is not instruction, and untrusted content should never silently gain authority.

A realistic production RAG roadmap

  • Make retrieval observable. Before changing models, log the query, filters, retrieved chunk IDs, scores, source versions, user and tenant, final context, citations, latency by stage, and feedback. Without this, you are debugging blind.
  • Fix ingestion and chunking. Improve PDF extraction, OCR correction, table preservation, heading-to-body linking, document versioning, metadata extraction, and duplicate detection.
  • Move to hybrid retrieval. Add keyword search, vector search, metadata filters, domain synonyms, reranking, and deduplication. Measure retrieval recall and final top-k quality separately.
  • Add freshness and access control. Add tenant filters, role filters, source permission sync, document version tracking, content hashes, embedding versions, access-policy versioning, and cache invalidation. This is the enterprise boundary.
  • Add advanced retrieval only where justified. Query rewriting, multi-query retrieval, graph expansion, agentic retrieval, and findings-first compliance retrieval come last — never before the basic retrieval system is measurable.
---

Production RAG checklist

Use this before calling a RAG system production-ready.

Retrieval quality

  • Hybrid retrieval supports keyword and vector search.
  • Metadata filters are applied before reranking.
  • Reranker is evaluated separately.
  • A retrieval test set exists and missed-query logs are reviewed.
  • Exact identifiers, short queries, and ambiguous queries are tested.
  • Duplicate chunks are detected and handled.
Freshness

  • Source updated, ingested, extracted, and embedded timestamps are stored.
  • Embedding model and version and a content hash are stored.
  • Document version is stored and superseded documents are excluded by default.
  • Cache invalidation rules exist.
Access control

  • tenant_id exists on every retrievable unit.
  • Role, group, and project access is represented in metadata.
  • Permission filters run before retrieval.
  • Unauthorized chunks never enter prompt context.
  • Cache key includes tenant, user, and access scope.
  • Permission sync is monitored and cross-tenant tests exist.
Ingestion

  • Raw source files and raw extraction JSON are preserved.
  • Tables are stored as structured data and semantic Markdown is generated.
  • Failed ingestion goes to a DLQ and reprocessing is possible.
  • Extraction version is tracked.
Observability and governance

  • Query, retrieved-chunk, filter, and citation logs exist.
  • Latency by stage is measured and a feedback loop exists.
  • Corrections update evaluation cases and drift monitoring exists.
  • No-answer and conflicting-source behavior is defined.
  • Sensitive workflows support human review and audit export is possible.
---

Frequently Asked Questions About RAG in Production

What is RAG in production?

RAG in production is a retrieval-augmented generation system that is reliable enough for real users, changing data, access control, monitoring, latency constraints, and auditability. It is not just a vector database connected to an LLM.

Why does RAG fail in enterprise environments?

RAG fails in enterprises because retrieval quality, stale data, access boundaries, OCR quality, metadata, and observability are often treated as secondary problems. In reality, these are the core production problems.

Is vector search enough for production RAG?

No. Vector search is useful for semantic similarity, but production RAG usually needs hybrid search: keyword search, vector search, metadata filtering, reranking, and freshness or authority checks.

How should access control work in RAG?

Access control should be enforced before retrieval. The system should resolve the user's tenant, role, groups, and document permissions first, then retrieve only the chunks that user is allowed to see.

How do you prevent stale answers in RAG?

Track source version, content hash, ingestion time, extraction time, embedding time, document status, and supersession relationships. Exclude outdated documents by default unless the user explicitly asks for historical versions.

Should production RAG use a graph database?

Only when real-time multi-hop traversal is a core requirement. Many systems can start with relational tables for entities, relationships, facts, and events before adding graph infrastructure.

What is the biggest mistake in enterprise RAG architecture?

The biggest mistake is treating RAG as an LLM feature instead of an enterprise retrieval system. The model is only the final layer; the hard work is ingestion, metadata, access control, retrieval quality, freshness, and auditability.

How should teams start improving a slow or inaccurate RAG system?

Start by logging retrieval behavior. Capture the query, filters, retrieved chunks, scores, reranked chunks, final context, citations, source freshness, permissions, and latency. Without retrieval observability, model changes are mostly guesswork.


Key Takeaways

  • RAG in production is an enterprise retrieval governance problem, not just a vector-search problem.
  • Retrieval quality usually breaks before generation quality.
  • Similarity is not authority; the system must know source freshness, approval status, and applicability.
  • Access control must be enforced before retrieval, not after generation.
  • OCR, tables, PDFs, and document structure can silently damage answer quality.
  • Hybrid search is the practical default for enterprise RAG.
  • A boring architecture — object storage, relational metadata, vector index, queue-based ingestion, hard filters, audit logs — is usually the right starting point.
  • The system must explain not only the answer, but why those sources were eligible to answer.
---

For teams building internal AI systems, the useful review is not "which vector database should we use?" The useful question is: what sources are eligible, what permissions apply, what data is stale, what retrieval failures are happening, what evidence enters the prompt, what is logged, and what can be audited.

If you want a second opinion on a RAG or enterprise AI architecture, get in touch — or read the Designing Secure AI Agents series, which covers the same governance boundary from the agent side.


References

AITechnologyJune 14, 2026
Share
Aakash Ahuja

Aakash Ahuja

Enterprise AI, Cybersecurity & Platform Engineering

Aakash writes about secure AI agents, microservices architecture, enterprise platforms, and production engineering. He has 20+ years of experience building and operating software systems across banking, cloud, cybersecurity, AI, and enterprise workflow automation. He is the founder of ITMTB and teaches AI, Big Data, and Reinforcement Learning at top institutes in India.