The RAG pitch is compelling. Connect your internal documents to an LLM, and the assistant automatically draws on contracts, procedures, meeting notes, and client records to answer your team's questions. In demos, it almost always works. The model retrieves the right passages, synthesises fluently, and the audience is sold.
Six weeks later, the mood has shifted. The answers are not wrong exactly: they are approximate. The system returns documents that are close but not quite right. It answers on an adjacent topic instead of the exact one. Users lose confidence gradually, fall back on manual search, and the project stalls between steering committee reviews. No dramatic failure. Just a slow erosion of trust.
The root cause is almost never the language model. It is not the choice between GPT-4o and Claude 3.7, nor the configuration of the system prompt. It is the quality of what you are asking it to retrieve: how documents are chunked, whether the embedding model fits your domain and language, whether any evaluation mechanism exists at all. A RAG project that holds up in production is first and foremost a data quality project, with an LLM at the output. That distinction changes everything about how you should design it.
The Wrong Question at the Start
When a team starts a RAG project, the first decision is almost always about tooling: LangChain or LlamaIndex? Pinecone or Weaviate? GPT-4o or Claude? These choices matter, but they come too early. They assume the bottleneck is in the plumbing, not in what you are pumping through it.
The right first question is: what is the state of your source documents? Are they structured consistently? Do they contain duplicates, or outdated versions sitting alongside current ones? Were they extracted from PDFs with poor OCR that corrupts characters at every column break? Are the relevant sections buried inside 200-page documents packed with regulatory appendices that add noise without value?
This upstream diagnostic is rarely done because it is not glamorous. It does not require a GPU and does not make for an impressive demo. But it determines the quality of everything downstream. A vector index built on poor-quality documents will return poor-quality retrievals, regardless of how sophisticated the rest of the pipeline is. The "garbage in, garbage out" principle applies to RAG with particular force: approximations compound at every stage, from chunking through embedding, all the way to the final answer the user reads.
Chunking: Where Everything Quietly Goes Wrong
Chunking is the operation that splits a document into fragments before indexation. It is the most underestimated step in the RAG pipeline, and often the first source of silent failure.
The Problem with Fixed-Size Splitting
The most common strategy is fixed-size chunking: split the text every 512 tokens, with a 50-token overlap to avoid cutting mid-sentence. It is simple, fast, and produces poor results on real business documents.
A standard operating procedure document, for example, alternates section headings, action lists, compliance tables, and footnotes. Cutting at 512 tokens without regard for semantic structure produces fragments that mix the end of one step and the beginning of the next, or that separate a table from its introductory context. The fragment is technically present in the index, but practically useless for answering a specific question.
A comparative study published in PMC in 2025 on clinical decision support illustrates the gap starkly: adaptive chunking aligned to logical topic boundaries reached 87% accuracy on the same corpus, compared to 13% for fixed-size splitting. The medical domain is inherently demanding, but the mechanism is identical for legal documents, internal procedures, or technical specifications.
Moving to Structure-Aware and Semantic Chunking
The strategy that performs best on enterprise documents is structure-aware chunking: treat headings and subheadings as natural boundaries, never split a table in two, and use semantic content to identify thematic breaks within a section.
The NVIDIA guide on chunking strategies offers a useful rule: for factual queries (a date, an amount, a name), fragments of 256 to 512 tokens are sufficient. For analytical queries (a comparison, a process explanation, a recommendation), fragments of 1,024 tokens or more are needed so the LLM has enough context to synthesise a coherent answer.
The practical rule we apply in every engagement: each chunk should be able to answer a question on its own. If the retrieved fragment requires knowing what comes before or after to make sense, the chunking is wrong.
Embeddings: Choose for Your Documents, Not for the Leaderboard
Once the fragments are created, they need to be converted to vectors. The embedding model determines how close "travel expense reimbursement policy" and "how do I claim back my business mileage?" are in vector space. The better the model fits your domain and language, the more relevant your retrievals will be.
The General Benchmark Trap
The standard reflex is to pick the model topping the MTEB leaderboard (Massive Text Embedding Benchmark), the sector reference across 56 tasks. It is a reasonable starting point, but a misleading oracle. The MTEB score is an average across classification, clustering, semantic similarity, and retrieval tasks. A model that dominates classification may underperform on retrieval, which is precisely the task that matters for RAG.
In 2026, strong candidates for multilingual retrieval include Alibaba's Qwen3-Embedding-8B and Cohere embed-v4, designed for long-context enterprise environments. For a deployment in French, German, or Spanish with mixed-language documents, the overall leaderboard score is not enough to decide.
The Mandatory Test on Your Own Data
There is no shortcut: before committing to an embedding model, build a representative test set (100 to 300 question/expected-document pairs) and measure recall@5 and MRR (Mean Reciprocal Rank) for each candidate. A model ranked twelfth on MTEB may easily outperform the number-one model on your specific corpus, especially if your documents are technical, domain-specific, or written in formal register.
This step takes a day. It saves weeks of debugging once you are in production, and often avoids a costly model migration six months down the line.
Reranking: The Fastest Lever for Precision
Even with a good embedding model and solid chunking, the first pass of vector retrieval is imperfect. Cosine similarity search returns the closest fragments in embedding space, but vector proximity is not the same as semantic relevance for a given query.
Reranking adds a second pass: retrieve 50 to 100 candidates with fast vector search, then use a cross-encoder to re-score them by examining each (query, fragment) pair together, with much finer bidirectional understanding.
Bi-Encoders and Cross-Encoders: Two Distinct Roles
A bi-encoder (the standard embedding model) encodes the query and the document separately, then computes a similarity score. It is fast and scales to millions of documents. A cross-encoder examines them together, capturing semantic relationships that independent vector comparison misses systematically.
The Pinecone guide on rerankers documents NDCG@10 improvements of 15 to 30% on representative benchmarks, at a latency cost of roughly 100 to 150 milliseconds for a batch of 50 candidates. On complex business queries, that trade-off is straightforward.
When to prioritise adding a reranker: on corpora of more than 10,000 documents, on analytical queries (not just factual lookups), and whenever users report retrievals that are "close but not quite right." It is usually the modification with the best impact-to-effort ratio in an existing RAG pipeline.
Continuous Evaluation: The Only Way to Avoid Drift
The main reason RAG projects stall without ever truly failing is the absence of measurement. The project works at launch, early feedback is positive, and no one measures what happens next. Quality degrades gradually: source documents age, new query patterns emerge, and the system starts missing the mark without anyone being able to quantify it.
The Four RAGAS Metrics
The RAGAS framework (Retrieval Augmented Generation Assessment) provides four complementary metrics for evaluating a RAG pipeline without expensive human annotations:
Faithfulness. Is the answer entirely grounded in the retrieved fragments? A faithful answer invents nothing absent from the context.
Answer Relevancy. Does the answer address the question asked, without redundant or off-topic content?
Context Precision. Are the relevant fragments ranked at the top of the retrieved results?
Context Recall. Do the retrieved fragments contain all the information needed to answer the question?
A healthy RAG pipeline scores well on all four dimensions. A pipeline that retrieves well but hallucinates has good context recall and poor faithfulness. A pipeline that retrieves the wrong documents has poor context precision, and everything else suffers as a result.
The Golden Dataset: Your Most Valuable Asset
RAGAS metrics are only as useful as the evaluation set you run them against. Build a proprietary golden dataset: 100 to 300 questions representative of your users' real queries, with reference answers and expected source documents.
This dataset is the most valuable asset of your RAG project. It lets you test the impact of every change (new embedding model, revised chunking strategy, new document sources) before deploying it. And it lets you measure drift over time: if faithfulness scores drop between January and March, something has changed in your documents or your query patterns.
In practice, we integrate evaluation into the CI/CD pipeline: every modification reruns against the golden dataset, and a score below threshold blocks deployment. It is no different from unit tests on a standard software project, and it is the only way to know whether a change is actually an improvement.
Document Governance: The Work Nobody Wants to Do
Behind every chunking, embedding, and evaluation problem, there is often a deeper issue: nobody really knows what is in the document base that was indexed.
The Shared Drive Syndrome
Most enterprise RAG projects start with "all the documents": the SharePoint, the Google Drive, the document management system. In practice, that means signed contracts alongside unfinished drafts, 2019 procedures that have never been updated, duplicate files with slightly different names, meeting notes without context, and PDFs from poor-quality scans whose OCR output is barely readable.
A vector index built on that corpus will return all of these artefacts indiscriminately. The system can answer a compliance question by citing a procedure that has been obsolete for three years, with nothing in the pipeline to flag it. A paper published on arxiv in November 2025, RAG-Driven Data Quality Governance for Enterprise ERP Systems, documents precisely how RAG pipelines deployed on enterprise ERP systems degrade in the absence of active document governance: duplicated data, missing metadata, no TTL on documents, and no way to trace an answer back to its source.
Four Practical Rules
Owner per collection. Every document collection has a designated owner, responsible for updates and archiving obsolete versions. Without an owner, no one removes expired documents.
Expiry dates on time-sensitive documents. Procedures, contracts, and internal policies have a validity date. Store it as metadata and filter expired documents upstream of indexation, or tag them so the LLM can flag them explicitly in its answers.
Inherited access control. RAG must not expose through its answers documents a user would not be allowed to consult directly. The indexation scope for a given user must reflect their actual document rights. This is an architectural concern to address from day one, not a retroactive patch.
OCR quality check before indexation. A pre-processing pipeline verifies the readability of documents extracted from PDFs before they enter the index. A document with 20% corrupted tokens degrades the quality of every neighbouring document in the collection.
Document governance is not an AI topic. It is an information management topic, one most organisations have left unowned. RAG makes it visible because it amplifies source quality, in both directions.
What We Do in Practice
When a client comes to us with a stalling RAG project, we always start with the same four-point audit: source quality (OCR, duplicates, outdated documents), current chunking strategy (size, overlap, structure awareness), presence or absence of a reranker, and whether a golden dataset and tracking metrics exist.
On the projects we take over, the majority of answer quality problems resolve through a combination of two fixes: revising the chunking strategy and adding a cross-encoder reranker. It is not spectacular work, but it is what separates an assistant that misses the point from one that teams actually use every day.
The remaining problems are documentary: an ungoverned corpus that is too large, outdated documents, no ownership. It is the longest work, but also the only work that guarantees durable quality over time. A well-built RAG system in 2026 is as much a document management infrastructure as it is an AI project.
Want to look at your specific case together? Book a slot and we will spend 30 minutes auditing your document sources and identifying the levers that will unblock your RAG project.