AI Explained

What is RAG (retrieval-augmented generation), and how does it work?

RAG, or retrieval-augmented generation, is the trick that lets AI models answer questions about your own documents. Here is how it actually works, and where it falls down.

Anyone who has used ChatGPT, Claude, or Gemini for real work will have hit the same wall at some point. You ask the model about something specific. Last week’s news. A clause in a contract. The latest version of an internal product. And the model either makes something up that sounds plausible, or politely tells you its training data only goes up to a certain date. RAG is the reason that wall is starting to come down.

RAG stands for retrieval-augmented generation. The name is fairly literal once you unpack it. A normal large language model generates text by predicting the next word based on patterns it learned during training. It does not look anything up in the moment. If the answer is in its training data, you might get it. If it is not, you get whatever the model thinks is the most plausible-sounding sentence to follow your question. The model is essentially writing from memory, and that memory was frozen at the point training stopped.

RAG adds a step before the model writes its answer. When you ask a question, the system first goes and retrieves relevant documents from some external source. That source could be your company’s internal wiki, a folder of PDFs, a website, a customer support archive, or a curated knowledge base. The retrieval step finds the most relevant passages. Then the model is given those passages alongside your original question, and it writes its answer using both.

The technique was introduced in a 2020 paper from Facebook AI Research called “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” The original idea was to combine the fluent language generation of a seq2seq model with the factual grounding of a retrieval system over Wikipedia. The motivation was straightforward. Language models were getting good at writing, but they were unreliable when you asked them to be precise about a specific fact. Letting them look something up first, the authors argued, fixed a lot of that.

A useful analogy. Think of a normal language model as a friend who has read enormously widely but cannot check anything in the moment. They are confident, articulate, and sometimes wrong. RAG turns that friend into one who pauses, walks over to a filing cabinet, finds the relevant papers, reads them, and then comes back to answer you. Slower, more cautious, but the answer is grounded in something concrete.

The retrieval step is usually done with what is called a vector database. Each document in your knowledge base is converted into a long list of numbers, called an embedding, that captures its meaning. Your question is converted into the same kind of list. The system finds documents whose embeddings sit close to your question’s embedding in that mathematical space. Closeness here is a stand-in for relevance. So if you ask about expense policy, the system pulls back the expense policy document, even if your exact words never appear in it, because the meaning is what is being matched.

Once the relevant chunks are pulled out, the system bundles them together with your original question into a prompt, and hands the whole thing to the language model. The model then writes its answer with the retrieved passages right in front of it.

A lot of people assume RAG means the model has somehow been trained on the company’s documents. That is not what is happening, and the difference matters. The model itself is unchanged. Its weights, its training data, its general knowledge, all of that stays the same. What changes is the prompt. RAG is essentially a very clever way of stuffing the right pages into the model’s reading material at the moment it answers.

This distinction has real consequences. It means RAG is fast to set up because you are not retraining anything. You can add new documents tomorrow morning and the system will use them this afternoon. It also means the model is not learning your information in any persistent sense. Pull the documents away and the model knows nothing about them again. That is actually a feature for sensitive data, because nothing about your private documents leaks into the model’s permanent knowledge.

It also means the quality of your answers depends almost entirely on the quality of your retrieval. If the retrieval step pulls back the wrong passages, the model will confidently use those wrong passages to write the wrong answer. RAG does not make a language model smarter. It changes what is on the desk in front of it. Garbage in still produces garbage out, just dressed up more neatly.

The other common misunderstanding is that RAG eliminates hallucination, the well-known habit language models have of inventing facts that sound right. It reduces hallucination. It does not eliminate it. If the retrieved passage genuinely contains the answer, RAG works well. If it does not, the model can still drift into making things up, especially if your question pushes it to combine ideas across passages or to reason about something the passages only hint at.

If you are using AI at work and getting frustrated that it does not know your company’s policies, your customer history, or your internal product names, RAG is almost certainly the technique that will solve that. Most enterprise AI tools you can buy today, including Microsoft Copilot for business, the document chat features in Notion, and search assistants like Glean, are RAG systems under the bonnet. The model on top might be GPT, Claude, or Gemini, but the trick that lets it answer about your stuff is retrieval.

What you should look at when evaluating one of these tools is not the model. The model is rarely the differentiator. What matters is the retrieval. How is the system chunking your documents into searchable pieces? How is it deciding what is relevant? What happens when the answer is spread across several files? How does it handle permissions, so people only retrieve what they are allowed to see? Those are the questions that decide whether the tool actually works in practice.

If you are building something yourself, the lesson is the same. Spend your time on the retrieval pipeline, not on tweaking the model. Most disappointing RAG deployments do not fail because the language model is weak. They fail because the retrieval step is pulling back irrelevant passages, and the model is doing its best with whatever it has been handed.

The bigger picture is that RAG quietly redrew the line between what a language model knows and what it can answer. Models will keep getting better, but the more interesting frontier for most real-world uses is no longer the model itself. It is what you give it to read in the moment.