RAG in Production: What Breaks at Enterprise Scale
RAG in production does not usually fail because the language model cannot write a good answer. It fails because the retrieval layer is allowed to behave like a demo search box inside an enterprise.
Retrieval-Augmented Generation, or RAG, is the pattern where an LLM retrieves external knowledge before generating an answer. The original RAG work combined a parametric model with non-parametric memory retrieved from a dense index, and modern cloud platforms describe RAG as a way to ground model output in external data sources rather than relying only on model training data.
That definition is accurate, but incomplete for production.
At enterprise scale, RAG is not just "LLM + vector database." It becomes a governed retrieval system that must decide which data is eligible, current, authorized, precise, traceable, and safe enough to enter the model context.
The hard production question is not:
Can we retrieve similar chunks?
The hard question is:
Can we retrieve the right evidence, from the right version, for the right user, under the right access boundary, fast enough, with enough traceability to defend the answer later?
That is where most RAG systems start breaking.
Table of Contents
- Why RAG prototypes work before they fail
- The real setup: from SQLite + FAISS to enterprise constraints
- What breaks first: retrieval quality
- Why vector search alone is not enough
- Why stale data becomes a production risk
- Why access boundaries must be enforced before retrieval
- Why OCR, tables, and PDFs quietly damage RAG quality
- What architecture works for enterprise RAG
- How to evaluate production RAG
- What worked, what did not, and what to do next
- Production RAG checklist
- FAQ
- Key takeaways
Why do RAG prototypes work before they fail?
A RAG prototype usually works because the environment is forgiving.
The dataset is small. The documents are mostly clean. The users are trusted. The question set is narrow. The data does not change much. Nobody is checking whether the answer came from the latest version of the policy. Nobody is asking whether a sales user accidentally retrieved a finance-only document.
That is why the first demo feels magical.
You upload files, chunk them, create embeddings, store them in FAISS or a vector database, ask a question, retrieve top-k chunks, pass them to the LLM, and get a grounded answer.
For a demo, this is enough.
For an enterprise, this is the beginning of the problem.
A production RAG system has to survive conditions that the prototype avoided:
| Prototype assumption | Enterprise reality |
|---|---|
| All users can see all documents | Users have roles, teams, regions, tenants, and exceptions |
| Documents are static | Policies, contracts, SOPs, catalogs, tickets, and reports change |
| Text extraction is clean | PDFs, scans, tables, merged cells, images, and forms break structure |
| Top-k similarity is enough | Similarity does not prove authority, freshness, or applicability |
| One index is manageable | Multi-tenant systems need isolation, filtering, versioning, and audit trails |
| Good answer means good system | The system must prove which sources were used and why |
| Latency does not matter much | Production users expect seconds, not minutes |
RAG is easy when all data is equally trusted. Enterprise RAG starts when data is not equally visible, current, or authoritative.

What was the original setup, and why was it not enough?
One practical journey started with a custom RAG system built without LangChain.
The early stack was intentionally simple:
- SQLite for chunks, entities, relationships, facts, and events.
- FAISS for vector search.
- OpenAI for embeddings and generation.
- A custom retrieval pipeline.
- File ingestion with OCR and table extraction.
- No Kafka.
- AWS as the likely production environment.
- India-only storage as a deployment constraint.
- Tenant separation as a known future requirement.
It gave control over the pipeline. It made the data model visible. It avoided framework magic. It allowed experimentation with chunks, entities, relationships, facts, and events instead of reducing everything to anonymous text blocks.
But the production constraints were very different:
- ingest around 100 documents per day,
- batch ingest around 10 documents per minute,
- handle OCR, table extraction, and LLM structuring,
- serve around 100 QPS for retrieval,
- support hybrid search,
- isolate tenants,
- keep response latency low,
- monitor accuracy and drift,
- preserve traceability,
- avoid cross-tenant leakage,
- and reduce a full answer path that was taking roughly two minutes.
The MVP proves the retrieval pattern. Production proves the operating system around it.
What breaks first in RAG in production?
Retrieval quality breaks first.
Not generation. Not the LLM. Retrieval.
The model can only answer from the evidence it receives. If the right chunk is not retrieved, the model has three bad options:
- answer from incomplete context,
- hallucinate missing details,
- refuse or give a vague answer.
A 2026 production-style RAG fusion study makes the same point in a different way: retrieval improvements do not automatically translate into better end-to-end answers once reranking limits, context-window limits, and latency constraints enter the system.
That observation matters because many teams optimize isolated retrieval metrics and assume the answer quality will improve. In production, extra recall can get neutralized by reranking, deduplication, truncation, conflicting chunks, or latency limits.
Failure mode 1: the right content exists but is not retrieved
One real retrieval problem looked like this:
Query: "Give me list of items needed for visual inspection of machine."
The required content existed, but retrieval missed it because the system did not match the phrase "list of items" to the relevant procedural section.
This is a common enterprise problem.
Users do not ask questions using the exact language of SOPs, manuals, or engineering documents. They ask in operational language:
- "items needed"
- "things to check"
- "what do I inspect"
- "machine visual inspection"
- "pre-check list"
- "before starting"
- "required materials"
The fix is not "use a better embedding model" as the first response. The better response is to redesign retrieval:
- preserve section headings,
- capture document type and process step metadata,
- use hybrid keyword + vector search,
- add domain synonyms,
- use query rewriting carefully,
- rerank after candidate retrieval — a reranker is a second-pass model that reorders the retrieved candidates by true relevance before the top few reach the LLM,
- log missed-answer cases,
- and build a retrieval test set from real user queries.
Failure mode 2: chunks are not the unit of meaning
Many RAG systems treat chunks as if they are naturally meaningful.
They are not.
A chunk is often just an artifact of token limits.
In enterprise documents, the actual unit of meaning may be a policy clause, a procedure step, a row in a table, a field in a form, a machine inspection checklist, a contract obligation, a report finding, a support resolution, or a regulation-to-SOP mapping.
If chunking cuts across those boundaries, retrieval quality is damaged before embeddings are created.
Bad chunking creates several problems:
| Bad chunking pattern | Production impact |
|---|---|
| Heading separated from body | Retrieved text loses meaning |
| Table row separated from header | Values become ambiguous |
| Clause split across chunks | Answer misses conditions and exceptions |
| Large chunk with many topics | Similarity score becomes noisy |
| Tiny chunk with no context | Reranker cannot judge relevance |
| Duplicated boilerplate chunks | Generic text outranks specific answer |
| Old and new versions both indexed | Stale source may win retrieval |
Failure mode 3: similar does not mean authoritative
Vector search finds semantic similarity.
It does not know whether a retrieved chunk is current, approved, superseded, tenant-specific, region-specific, role-visible, legally binding, draft, archived, or contradicted by a newer document.
This is why one of the strongest rules for production RAG is:
Similarity is not authority.
A highly similar old policy can be more dangerous than no answer. A generic FAQ can outrank the exact SOP. A template can outrank a signed contract. A draft can outrank an approved version.
The retrieval layer must therefore use metadata and source status, not just embedding distance.
Why is vector search alone not enough for enterprise RAG?
Vector search is useful, but enterprise RAG needs hybrid retrieval.
Hybrid search combines keyword and full-text retrieval with vector retrieval. Microsoft's Azure AI Search documentation describes hybrid search as a single query configured for both full-text and vector queries, running them in parallel and merging results using Reciprocal Rank Fusion. It also notes that keyword search performs better for product codes, specialized jargon, dates, and people's names.
That matches production reality.
Enterprise users search for things like invoice numbers, SKU codes, policy IDs, ticket IDs, machine model numbers, clause references, employee names, locations, dates, regulation numbers, internal acronyms, and exact error messages.
Vector search can blur these. Keyword search can preserve them.
But keyword search alone fails when users ask conceptual questions in natural language.
So production retrieval usually needs a layered pipeline:
User query
↓
Identity + tenant + role resolution
↓
Query normalization
↓
Metadata filter construction
↓
Keyword retrieval Vector retrieval
↓ ↓
Candidate merge
↓
Deduplication
↓
Reranking
↓
Freshness + authority scoring
↓
Context assembly
↓
Answer generation with citations
↓
Audit log + feedback captureThis is no longer "call vector DB and pass top-k to the model."
It is a retrieval system.
Vector search vs hybrid search
| Dimension | Vector search | Hybrid search |
|---|---|---|
| Best for | Conceptual similarity | Mixed exact + semantic retrieval |
| Weakness | Can miss exact identifiers | More complex scoring and tuning |
| Handles well | "How do I inspect a machine before use?" | "Inspection checklist for MX-204 machine revision 3" |
| Production role | One signal | Default retrieval strategy |
| Risk if used alone | Similar but wrong chunks | More complexity, but better control |
Why does stale data become a production RAG risk?
Stale data is one of the most underestimated RAG failure modes.
In a prototype, documents are uploaded once. In an enterprise, source data changes continuously: policies are updated, SOPs are revised, contracts are amended, user permissions change, folders move, product catalogs change, tickets are resolved, regulations are updated, reports are regenerated, and old documents are archived.
If the RAG index does not reflect those changes, the answer may be generated from stale evidence.
This is dangerous because the model may still sound confident.
Stale RAG is worse than normal search failure in regulated or operational workflows because it can create false assurance. A user may act on an answer that cites an outdated policy, superseded SOP, or old contract clause.
The freshness problem has multiple layers
Freshness is not one timestamp. A production RAG system needs to track several freshness dimensions:
| Layer | Freshness question |
|---|---|
| Source | When did the source system change? |
| Ingestion | When did we ingest or sync it? |
| Extraction | Was text, OCR, or table extraction regenerated after source change? |
| Embedding | Were embeddings regenerated after content change? |
| Permission metadata | Were ACL or RBAC changes synced? |
| Versioning | Is this document current, archived, or superseded? |
| Cache | Is the answer or retrieval cache invalid? |
| Citation | Does the citation point to the current source? |
The same principle applies to content freshness. If the source changed but the index did not, retrieval is stale. If permissions changed but the permission index did not, access control is stale.
Minimum freshness model for production RAG
At minimum, store these fields:
document_id
source_system
source_uri
source_updated_at
ingested_at
extracted_at
embedded_at
embedding_model
embedding_version
content_hash
extraction_version
document_version
is_current
superseded_by_document_id
approval_status
valid_from
valid_until
tenant_id
access_policy_versionThis gives you the ability to answer two critical audit questions:
- Was the retrieved source current at the time of the answer?
- Was the user allowed to retrieve it at the time of the answer?
Why must access boundaries be enforced before retrieval?
Access control is the section where enterprise RAG becomes serious.
A bad RAG design does this:
Retrieve globally
↓
Put chunks into prompt
↓
Tell model not to reveal unauthorized dataThat is not access control. Once unauthorized content enters the prompt, the boundary has already failed.
The correct pattern is:
Resolve user identity
↓
Resolve tenant, role, team, region, project, document permissions
↓
Apply hard filters before retrieval
↓
Retrieve only eligible chunks
↓
Rerank only eligible chunks
↓
Generate answer only from eligible chunks
↓
Log permission contextA model cannot be trusted to "unsee" unauthorized context.
OWASP's 2025 LLM Top 10 includes vector and embedding weaknesses, warning that weaknesses in how vectors and embeddings are generated, stored, or retrieved can be exploited to inject harmful content, manipulate outputs, or access sensitive information in RAG systems.
This is why access control must sit inside retrieval, not only around the final answer. This same principle is the foundation of secure AI agent architecture — never put unauthorized data into the model's context in the first place.
The access problem is harder than tenant_id
A strict tenant_id is necessary, but insufficient. Enterprise access often depends on tenant, organization, business unit, user role, project, region, document classification, source-system permission, group membership, customer account assignment, data residency, legal hold, sensitivity label, and time-bounded access.
A production RAG index must carry permission metadata at the same granularity as retrieval. If you retrieve chunks, permissions must apply to chunks. If you retrieve rows, permissions must apply to rows. If you retrieve document sections, permissions must survive sectioning.
Microsoft's Azure AI Search documentation describes document-level access control patterns including security filters, ACL/RBAC scopes, sensitivity labels, and source ACLs, with query-time enforcement trimming results to documents the caller is authorized to read.
The general design principle is portable beyond any single vendor:
Do not retrieve first and secure later. Secure the candidate set before retrieval.
Access-control failure modes
| Failure mode | Example | Safer design |
|---|---|---|
| Cross-tenant retrieval | Tenant A retrieves Tenant B chunk | Physical or logical tenant isolation + hard filters |
| Role leakage | Sales user sees finance document | Role-aware metadata filter before search |
| Stale permissions | User removed from group but index still allows access | Permission sync + access_policy_version |
| Chunk loses parent ACL | Chunk table does not carry document permissions | Project ACL to every retrievable unit |
| Prompt-based security | "Do not reveal confidential info" | Never put unauthorized info in context |
| Tool bypass | SDK or API retrieves without policy layer | Central retrieval service with enforced policy |
| Cache leakage | Cached answer reused for another user | Cache key includes user, tenant, permission scope |
Why do OCR, tables, and PDFs quietly damage RAG quality?
Most enterprise knowledge is not clean Markdown. It is trapped in PDFs, scans, spreadsheets, reports, invoices, SOPs, forms, tables, images, email exports, contracts, intranet pages, ERP screens, and old document templates.
This creates a hidden ingestion problem. The retrieval layer can only search what the ingestion layer preserved.
If OCR corrupts the text, embeddings encode corrupted text. If table headers are lost, table rows become meaningless. If merged cells are flattened incorrectly, the answer may cite the wrong value. If a table is split across pages, the system may retrieve only half the evidence.
Azure's RAG documentation explicitly lists large documents, images and PDFs, OCR, document extraction, chunking, terminology mismatches, hybrid queries, and semantic ranking as content-preparation and relevance concerns for RAG.
That is the boring but critical truth: ingestion quality is retrieval quality.
The table problem
Tables are especially dangerous. A table row often depends on column headers, section title, page title, merged cell labels, the preceding paragraph, a footnote, units, version, and page number.
If you embed only the row text, meaning disappears.
Bad representation:
10 | 20 | Yes | No | 4.5Better representation:
Document: Machine Inspection SOP v3
Section: Visual Inspection Checklist
Table: Required inspection items before machine start
Row meaning:
- Item: Belt tension
- Required before start: Yes
- Acceptable range: 4.5 mm
- Applies to: MX-204
- Page: 7
For RAG, a table row should become a self-contained fact while preserving raw source references.
Recommended multi-layer document representation
For serious document ingestion, store multiple derived forms:
| Layer | Purpose |
|---|---|
| Raw file | Legal and source artifact |
| Raw OCR / extraction JSON | Auditability and reprocessing |
| Normalized text | Basic search |
| Semantic Markdown | Human-readable retrieval context |
| Table CSV / JSON | Structured table retrieval |
| Row-level facts | Precise QA over tables |
| Entities and relationships | Cross-document linking |
| Chunk embeddings | Semantic retrieval |
| Metadata | Filters, versioning, permissions, provenance |
What production RAG architecture works at enterprise scale?
The architecture that works is usually less glamorous than teams expect.
Do not start with an exotic graph database, multi-agent retriever, and five orchestration frameworks. Start with a boring, inspectable architecture:
Source systems → Ingestion queue → Extraction pipeline
(S3, uploads, (SQS-style queue, (OCR, PDF parsing,
intranet, retries, DLQ, table extraction,
databases, worker autoscaling) metadata, hashing)
ERP/CRM/ticketing)
↓
Storage
- S3 for raw files and extracted artifacts
- Postgres/Aurora for metadata, chunks, facts, relationships
- pgvector initially, or Qdrant when scale/isolation/filtering requires it
↓
Indexing
- full-text index - vector index
- metadata index - permission index
↓
Retrieval service
- identity resolution - tenant + RBAC/ACL filters
- keyword search - vector search
- reranking - freshness scoring
↓
Generation service
- context assembly - source citations
- answer generation - refusal / no-answer behavior
↓
Observability and audit
- query logs - retrieved + filtered chunks
- citations - latency - feedback + correction loop
Why this architecture works
It separates the system into layers with different responsibilities.
| Layer | Responsibility |
|---|---|
| Source connectors | Pull or receive data without corrupting source authority |
| Ingestion queue | Decouple document arrival from processing |
| Extraction pipeline | Convert messy source files into structured artifacts |
| Metadata store | Preserve version, tenant, permissions, lineage, status |
| Vector index | Support semantic retrieval |
| Full-text index | Support exact retrieval |
| Permission engine | Prevent unauthorized candidates from entering retrieval |
| Reranker | Improve final evidence selection |
| Generation layer | Answer from selected evidence |
| Audit layer | Explain what happened later |
pgvector itself supports vector search in Postgres, with HNSW and IVFFlat index types that involve different memory, build-time, and speed/recall tradeoffs.
The key is not to marry the application to one vector store too early. Use a VectorStore or Retriever abstraction from the beginning.
Should you use pgvector, Qdrant, FAISS, or a managed search service?
The answer depends on the stage — and the vector store is one line item in the wider enterprise LLM deployment cost model.
| Stage | Sensible choice | Why |
|---|---|---|
| Local prototype | FAISS + SQLite | Fast, cheap, simple |
| Early SaaS / MVP | Postgres + pgvector | One operational database, easier metadata joins |
| Multi-tenant production | Qdrant / OpenSearch / managed vector DB | Better filtering, isolation, operations, scaling |
| Compliance-heavy enterprise | Search with document-level ACL support | Access control becomes central |
| Very complex navigation | Agentic retrieval over existing search | Useful when one-shot retrieval is insufficient |
A vector database is not a governance layer.
When should you add a graph database?
Not early. A graph model is useful when the product genuinely needs real-time multi-hop traversal:
- "Which SOPs are affected by this regulation change?"
- "Which machines, procedures, and inspection findings connect to this failure?"
- "Which contract obligations depend on this customer region and product line?"
Move to a graph database only when query patterns demand it. Do not add a graph database because RAG sounds relational. Add it because users need graph traversal that relational queries cannot support cleanly under production latency.
How should production RAG handle compliance workflows?
Compliance RAG should not behave like generic chat over documents. In compliance workflows, wrong retrieval has higher consequences. The answer must distinguish regulation, internal policy, SOP, evidence, finding, interpretation, recommendation, and open question.
A safer compliance retrieval pattern is:
User compliance question
↓
Check structured findings first
↓
Check approved policy / regulation mappings
↓
Retrieve narrow supporting evidence
↓
Generate answer with exact citations
↓
Flag uncertainty or missing source
↓
Log answer for auditDo not always start by retrieving raw chunks. If a system has already produced compliance findings from SOPs, regulations, and enforcement documents, the first retrieval layer should check those structured findings before falling back to general RAG. This reduces the chance that a generic but similar paragraph outranks a validated finding.
Compliance RAG must support refusal
A compliance RAG system should be able to say:
- "No approved source found."
- "The retrieved source is outdated."
- "This answer depends on a policy version that has been superseded."
- "This user does not have access to the required evidence."
- "The source documents conflict."
- "This requires human review."
NIST's AI Risk Management Framework and Generative AI Profile are useful reference points for thinking about AI risk management, especially around governance and risk identification for generative AI systems.
How should you evaluate production RAG quality?
Do not evaluate only the final answer. Evaluate the retrieval pipeline separately.
A production RAG system needs at least five evaluation layers:
| Evaluation layer | What to measure |
|---|---|
| Retrieval recall | Did the correct source appear in the candidate set? |
| Ranking quality | Did the correct source survive into final top-k? |
| Context quality | Was the final context sufficient and non-conflicting? |
| Answer faithfulness | Did the answer stay grounded in the retrieved evidence? |
| Operational quality | Was the answer fast, authorized, current, and traceable? |
Build a retrieval test set from real misses
Start with real queries, not synthetic perfect questions. Include short queries, ambiguous queries, internal acronyms, wrong terminology, exact IDs, policy questions, table questions, multi-document questions, role-restricted questions, stale-document traps, and no-answer questions.
For each test query, record:
query
expected_document_id
expected_chunk_ids
acceptable_alternate_sources
required_metadata_filters
required_permission_scope
expected_answer_type
must_refuse_if_no_source
freshness_requirementThen test keyword only, vector only, hybrid, hybrid + reranking, hybrid + query rewrite, hybrid + metadata filters, and hybrid + freshness scoring. The goal is not to prove one technique is best. The goal is to know which failures each technique catches.
What should retrieval observability capture?
LLM observability is not enough. Production RAG needs retrieval observability. For every answer, you should be able to reconstruct who asked, what they asked, which tenant and access scope was used, which sources were searched, which filters were applied, which chunks were retrieved, which chunks were filtered out, which chunks were reranked, which chunks entered the prompt, which citations appeared, whether sources were current, how long each stage took, what the model answered, whether the user gave feedback, and whether the answer was later corrected.
Minimum log schema:
retrieval_event_id
answer_event_id
tenant_id
user_id
role_scope
query_text
normalized_query
filters_applied
source_systems_searched
candidate_chunk_ids
candidate_scores
reranked_chunk_ids
final_context_chunk_ids
citations_used
freshness_status
permission_policy_version
latency_keyword_ms
latency_vector_ms
latency_rerank_ms
latency_generation_ms
total_latency_ms
answer_confidence_label
feedback_score
correction_requiredThis is the difference between a demo and a system. A demo answers. A system explains how the answer happened.
What worked, what did not, and what to do next?
Several things were directionally right in the journey.
Building without LangChain gave control. Frameworks are useful, but early direct implementation made the data model visible. The system stored not only chunks but entities, relationships, facts, and events, which made it easier to reason about retrieval modes beyond top-k vector similarity. This relates closely to the difference between agent memory and state — what you persist determines what you can later retrieve and trust.
SQLite + FAISS was good for MVP. It is not enterprise-ready by itself, but it forces clarity on chunk schema, embedding lifecycle, local search, metadata gaps, and retrieval behavior. The mistake would be to confuse MVP success with production readiness.
Hybrid retrieval became the right default. The retrieval miss around "list of items" exposed a real issue: user language and document language diverge. Hybrid retrieval combines semantic meaning with exact matching.
Queue-based ingestion was the right complexity level. For this scale, Kafka was unnecessary. A queue with workers, retries, visibility timeouts, and a dead-letter queue is enough for many early production RAG systems. Heavy OCR, PDF, and LLM ingestion should run in workers, not synchronous request logic.
What did not work or would not scale:
- Pure vector retrieval. It misses exact identifiers, struggles with short operational queries, and cannot decide authority or freshness.
- Treating FAISS + SQLite as production architecture. Production needs concurrent ingestion, query isolation, metadata filtering, permission enforcement, backup and restore, index lifecycle, observability, and operational ownership.
- Embedding every keystroke. For live search UI, use prefix and autocomplete for as-you-type, run semantic retrieval only on submit or a debounce threshold, cache query embeddings, and precompute document embeddings.
- Adding a graph DB too early. Start with relational facts and relationships; move to graph only when traversal becomes a core query path.
- Filtering unauthorized data after retrieval. This is the most serious failure. If unauthorized chunks enter retrieval or the prompt, the boundary has already been crossed. The same lesson appears in agent design: tool output is not instruction, and untrusted content should never silently gain authority.
A realistic production RAG roadmap
- Make retrieval observable. Before changing models, log the query, filters, retrieved chunk IDs, scores, source versions, user and tenant, final context, citations, latency by stage, and feedback. Without this, you are debugging blind.
- Fix ingestion and chunking. Improve PDF extraction, OCR correction, table preservation, heading-to-body linking, document versioning, metadata extraction, and duplicate detection.
- Move to hybrid retrieval. Add keyword search, vector search, metadata filters, domain synonyms, reranking, and deduplication. Measure retrieval recall and final top-k quality separately.
- Add freshness and access control. Add tenant filters, role filters, source permission sync, document version tracking, content hashes, embedding versions, access-policy versioning, and cache invalidation. This is the enterprise boundary.
- Add advanced retrieval only where justified. Query rewriting, multi-query retrieval, graph expansion, agentic retrieval, and findings-first compliance retrieval come last — never before the basic retrieval system is measurable.
Production RAG checklist
Use this before calling a RAG system production-ready.
Retrieval quality
- Hybrid retrieval supports keyword and vector search.
- Metadata filters are applied before reranking.
- Reranker is evaluated separately.
- A retrieval test set exists and missed-query logs are reviewed.
- Exact identifiers, short queries, and ambiguous queries are tested.
- Duplicate chunks are detected and handled.
- Source updated, ingested, extracted, and embedded timestamps are stored.
- Embedding model and version and a content hash are stored.
- Document version is stored and superseded documents are excluded by default.
- Cache invalidation rules exist.
tenant_idexists on every retrievable unit.- Role, group, and project access is represented in metadata.
- Permission filters run before retrieval.
- Unauthorized chunks never enter prompt context.
- Cache key includes tenant, user, and access scope.
- Permission sync is monitored and cross-tenant tests exist.
- Raw source files and raw extraction JSON are preserved.
- Tables are stored as structured data and semantic Markdown is generated.
- Failed ingestion goes to a DLQ and reprocessing is possible.
- Extraction version is tracked.
- Query, retrieved-chunk, filter, and citation logs exist.
- Latency by stage is measured and a feedback loop exists.
- Corrections update evaluation cases and drift monitoring exists.
- No-answer and conflicting-source behavior is defined.
- Sensitive workflows support human review and audit export is possible.
Frequently Asked Questions About RAG in Production
What is RAG in production?
RAG in production is a retrieval-augmented generation system that is reliable enough for real users, changing data, access control, monitoring, latency constraints, and auditability. It is not just a vector database connected to an LLM.
Why does RAG fail in enterprise environments?
RAG fails in enterprises because retrieval quality, stale data, access boundaries, OCR quality, metadata, and observability are often treated as secondary problems. In reality, these are the core production problems.
Is vector search enough for production RAG?
No. Vector search is useful for semantic similarity, but production RAG usually needs hybrid search: keyword search, vector search, metadata filtering, reranking, and freshness or authority checks.
How should access control work in RAG?
Access control should be enforced before retrieval. The system should resolve the user's tenant, role, groups, and document permissions first, then retrieve only the chunks that user is allowed to see.
How do you prevent stale answers in RAG?
Track source version, content hash, ingestion time, extraction time, embedding time, document status, and supersession relationships. Exclude outdated documents by default unless the user explicitly asks for historical versions.
Should production RAG use a graph database?
Only when real-time multi-hop traversal is a core requirement. Many systems can start with relational tables for entities, relationships, facts, and events before adding graph infrastructure.
What is the biggest mistake in enterprise RAG architecture?
The biggest mistake is treating RAG as an LLM feature instead of an enterprise retrieval system. The model is only the final layer; the hard work is ingestion, metadata, access control, retrieval quality, freshness, and auditability.
How should teams start improving a slow or inaccurate RAG system?
Start by logging retrieval behavior. Capture the query, filters, retrieved chunks, scores, reranked chunks, final context, citations, source freshness, permissions, and latency. Without retrieval observability, model changes are mostly guesswork.
Key Takeaways
- RAG in production is an enterprise retrieval governance problem, not just a vector-search problem.
- Retrieval quality usually breaks before generation quality.
- Similarity is not authority; the system must know source freshness, approval status, and applicability.
- Access control must be enforced before retrieval, not after generation.
- OCR, tables, PDFs, and document structure can silently damage answer quality.
- Hybrid search is the practical default for enterprise RAG.
- A boring architecture — object storage, relational metadata, vector index, queue-based ingestion, hard filters, audit logs — is usually the right starting point.
- The system must explain not only the answer, but why those sources were eligible to answer.
For teams building internal AI systems, the useful review is not "which vector database should we use?" The useful question is: what sources are eligible, what permissions apply, what data is stale, what retrieval failures are happening, what evidence enters the prompt, what is logged, and what can be audited.
If you want a second opinion on a RAG or enterprise AI architecture, get in touch — or read the Designing Secure AI Agents series, which covers the same governance boundary from the agent side.
References
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — original RAG research.
- Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases — AWS RAG, retrieval, and citations.
- RAG and Generative AI — Azure AI Search — content preparation, chunking, OCR, hybrid queries, semantic ranking.
- Hybrid Search Overview — Azure AI Search — full-text + vector search and Reciprocal Rank Fusion.
- Document-Level Access Control — Azure AI Search — query-time access trimming and permission sync.
- OWASP LLM08:2025 Vector and Embedding Weaknesses.
- NIST AI Risk Management Framework and Generative AI Profile.
- pgvector — IVFFlat and HNSW index tradeoffs.
- Scaling Retrieval Augmented Generation with RAG Fusion — production-style study on recall, reranking, context limits, and latency.
