article

RAG: what it is, when it works, when it doesn't

Language models are probability engines. They know a lot about everything and very little about something specific.

Language models are probability engines. They generate plausible continuations based on patterns, which makes them brilliant for creative work and useless for specific facts. They know a lot about everything and very little about something specific.

Retrieval Augmented Generation exists to bridge that gap. Instead of asking the model to generate an answer from its training, you give it access to real data and ask it to reason about that instead. Your knowledge, your documents, your facts, fed into the context at the moment it matters.

The concept is elegant. The implementation is where it gets interesting.

Why the model doesn't just know

Before getting into RAG itself, it's worth understanding why you need it in the first place.

Language models know an astonishing amount about an astonishing range of things. They've been trained on vast swathes of the internet, books, papers, code. Ask about history, science, literature, programming, and they'll give you something useful.

But here's the thing: they know a lot about everything and very little about something specific.

A model might know what a returns policy generally looks like. It doesn't know your returns policy. It might know how customer complaints typically get handled. It doesn't know your process. The training data gave it patterns, not your particulars.

This is fine for general questions. It's useless for specific ones. When someone asks about your product, your policy, your pricing, they don't want a plausible-sounding approximation based on how these things usually work. They want the actual answer.

The model can't give that from training. It doesn't have the information. It never saw your documentation, your systems, your specific terms. And even if it did, training data has a cutoff. Things change. The model doesn't know what changed yesterday.

This is the gap RAG fills. It gives the model access to your specific, current, accurate information at the moment it needs it.

The basics

The idea is straightforward. When someone asks a question, you don't just pass it to the language model and hope for the best. First, you search your knowledge base for relevant information. Then you include that information in the context alongside the question. The model generates an answer based on what you've retrieved, grounded in your actual data rather than its training.

This is how you stop a chatbot inventing product features that don't exist. How you make sure a support agent quotes the real policy. How you build systems that know things the base model was never trained on.

The concept is simple. The execution is where it gets interesting.

The idea is straightforward. When someone asks a question, you don't just pass it to the language model and hope for the best. The concept is simple. The execution is where it gets interesting.

The retrieval problem

RAG has two parts: retrieval and generation. Most people focus on the generation, because that's the bit with the language model in it. But the retrieval is where most RAG systems succeed or fail.

If you retrieve the wrong information, it doesn't matter how good your model is. It'll reason beautifully about irrelevant context and give you a confident, well-structured, useless answer. Garbage in, garbage out, but with better grammar.

Getting retrieval right means caring about things that sound boring until they bite you. Chunking strategy: how do you break your documents into pieces? Too big and you dilute the relevant bits with noise. Too small and you lose context. Embedding choices: which model turns your text into vectors? They're not all the same, and the differences matter. What you index and how: do you index raw text, summaries, questions the text might answer? Each approach has tradeoffs.

And then there's the matching problem. The embeddings you use to index your content need to match the embeddings you use to search it. Sounds obvious, but getting this wrong is surprisingly easy. Different embedding models, different vector lengths, different assumptions about what similarity means. We've seen systems fail silently because the in and out didn't match, returning results that looked plausible but weren't actually the most relevant.

The boundary problem

RAG works brilliantly when the answer is in your knowledge base. But what happens when it isn't?

This is where things get messy. You can extend the system to search the web when local retrieval fails, but now you're mixing curated, trusted information with whatever the internet serves up. Your carefully structured knowledge base gets polluted with content you haven't verified. The line between "things we know are true" and "things we found online" starts to blur.

There's no clean solution here. You can be strict and only answer from your own data, which means saying "I don't know" more often than users might like. You can be permissive and go to the web, which means accepting the quality tradeoffs that come with it. You can try to be clever and flag which answers come from which sources, but that adds complexity and most users won't read the fine print anyway.

The right answer depends on the use case. But you need to make a deliberate choice, not just let the system figure it out.

The underuse problem

For all the hype about RAG, we don't see it used as much as it could be. People reach for fine-tuning when retrieval would work better. They stuff everything into ever-larger context windows instead of being selective about what the model needs to see. They build complex agentic systems when a well-structured knowledge base and good retrieval would be simpler and more reliable.

RAG shines when you need to answer specific questions from a defined body of knowledge. When someone asks "what's your refund policy?" and you need the actual policy, not a plausible-sounding approximation. When context windows are a constraint and you can't just dump everything in and hope for the best.

It's less useful when questions are open-ended, when the knowledge base is sparse, or when the real problem is that you don't have good information in the first place. RAG can't fix bad data. It can only retrieve it more efficiently.

The use case question

The pattern we see is that RAG works brilliantly when people are clear about what they're trying to do. When the use case is defined, when the questions are predictable, when the knowledge base is well-structured. In those situations, it's transformative. Fast, accurate, grounded answers that would be impossible otherwise.

It struggles when the use case is vague. "Make our AI smarter" isn't a RAG problem. "Answer customer questions about our product catalogue" is. The more specific you can be about what questions you're trying to answer and what information you need to answer them, the better RAG will work.

This is true of most AI implementations, but it's especially true here. RAG isn't magic. It's plumbing. Good plumbing makes everything work smoothly. But you need to know where the water's supposed to go.

The craft

We've built a lot of RAG systems at this point, and the thing that strikes me is how much craft is involved. The concept fits on a napkin. The implementation is a series of judgement calls about chunking, embedding, indexing, retrieval, ranking, context assembly, prompt design. Each decision affects the others. Getting it right takes iteration, testing, and a willingness to admit when something isn't working.

The good news is that when it works, it really works. The bad news is that "just add RAG" is never as simple as it sounds.

Contacthello@yond.coPiano House, 9 Brighton Terrace, London
© 2026Company: 14180528 VAT: GB437558170