How Retrieval Augmented Generation (RAG) Makes AI Smarter and More Up-to-Date

4 min readNov 3, 2024

Source: https://arxiv.org/abs/2406.10149

Popular LLMs effectively use only 10–20% of the context, and their performance declines sharply with increased reasoning complexity. © Kuratov et al.

Big language models are amazing at processing information, but they hit limits. Loading tons of text directly into a prompt can be costly, slow, and not very effective. More data doesn’t always mean better answers, and this approach often creates more issues than it solves.

Take Google’s Gemini 1.5 Pro model, for example. It can process up to 1 million tokens in a single prompt. That’s about nine Harry Potter books! But, if you’re trying to get insights about a specific chapter, pouring all those books into the prompt is overkill. Instead, you’d want a way to pull up only what’s relevant to your question. That’s exactly what Retrieval Augmented Generation (RAG) does.

What is Retrieval Augmented Generation?

RAG is a simple but powerful solution that helps AI models find the right information when they need it. Think of it like a librarian who doesn’t memorize every book but knows where to look for the right answers. RAG makes language models faster, more accurate, and always up-to-date without needing to retrain them constantly.

How Retrieval Augmented Generation Works

Here’s a quick look at how RAG pulls up only the useful info:

Chunking: First, RAG breaks down documents into small, manageable “chunks.” These can be sentences or paragraphs, depending on what’s useful.
Embedding Chunks: Next, each chunk is turned into a unique code or vector using a model like OpenAI’s Ada or Snowflake’s open-source options. This makes it easy to find similar ideas later.
Storing in a Vector Database: These vectors are stored in a searchable database (like Pinecone, Qdrant, PGVector (for Postgres), Chroma, Weaviate, etc.) that can quickly pull up related info when needed.

Then, when a user asks a question:

Query Embedding: The query itself gets embedded as well, so it “speaks the same language” as the stored data.
Vector Similarity Search: The model checks the query against stored vectors and finds the chunks that match best. For example, if the query is about “tires,” the search might bring up relevant information about “cars” instead of unrelated stuff like “birds.”
Response Generation with Relevant Context: Finally, the chosen chunks go to the language model along with the query, so it has relevant info right there, resulting in a smarter, more focused answer.

Why Businesses Need RAG

RAG isn’t just clever — it’s also practical for businesses. Here’s why:

Keeps Info Up-to-Date: RAG can access data that’s refreshed as often as needed, so there’s no waiting around for a model retrain when things change.
Efficient Memory: By grabbing only relevant pieces, RAG uses less computing power and answers faster.
Clear and Reliable Outputs: Since RAG is pulling specific information, you can easily see why it generated an answer, cutting down on random hallucinations.

Fine-Tuning RAG for Your Needs

To get the most out of RAG, there are a few settings to adjust:

Chunk Size: Choosing the right size is key — too small, and the context breaks up; too big, and you might lose focus.
Vector Database Choice: Picking the right database impacts speed, accuracy, and cost.
How Many Chunks to Pass: Too many can be overwhelming; too few might miss important context.
Vector Size: The size of the vectors (or embeddings). Larger vectors capture more nuance and detail but need more storage and processing power. Smaller vectors are more lightweight but may miss some finer points in complex queries.
Embedding Metadata: Adding metadata (e.g. document type, creation date, author, etc.) can improve searches even further.
Adding Keyword Search: Combining vector searches with keyword searches can help narrow down results more effectively, especially when you are looking for specific terms.
And many other techniques.

Also, it is worth mentioning that RAG is flexible enough to work with more than just text. It can work with structured data from SQL tables, semi-structured data like MongoDB databases, and many other data types.

Wrapping Up

RAG is an essential tool for getting the most out of LLMs. By blending language capabilities with quick, current data from a vector database, RAG gives businesses sharper, more dependable answers while avoiding the expense of frequent retraining. It’s a smart solution to keep AI accurate, efficient, and always up-to-date.