Blog Post #22: Retrieval Augmented Generation (RAG): Giving Your Agent External Knowledge

Table of Contents

A Large Language Model, for all its power, is a closed book. Its knowledge is vast but frozen in time, limited to the data it was trained on. It doesn’t know about yesterday’s news, your company’s latest quarterly report, or the specifics of your private project files.

So, how do we build agents that can reason about information they were never trained on? We could try fine-tuning, but that process is slow, computationally expensive, and impractical for knowledge that changes daily.

The solution is a powerful and elegant pattern that has become the cornerstone of modern AI applications: Retrieval-Augmented Generation (RAG).

The Core Idea: An Open-Book Exam for LLMs

At its heart, RAG gives an LLM an “open-book exam.” Instead of forcing the model to memorize every fact (which is impossible), we give it a library of relevant information (our “textbook”) and the ability to look up the right facts just before it has to answer a question.

RAG connects your agent to external, up-to-date, or proprietary knowledge sources at the moment of the query, ensuring its responses are relevant, accurate, and grounded in fact.

The RAG Workflow: A Step-by-Step Breakdown

The RAG process is best understood as two distinct phases: the Indexing Pipeline (preparing the knowledge) and the Runtime Pipeline (answering a question).

=========================
| PHASE 1: INDEXING     | (Done once, or periodically to update knowledge)
=========================

+-------------------+      +-------------------+      +-------------------+
|  Your Documents   |----->|      Loader       |----->|      Splitter     |
| (PDFs, Web Pages) |      |  (Reads the data) |      |   (Chunks text)   |
+-------------------+      +-------------------+      +-------------------+
                                                               |
                                                               v
+-------------------+      +-------------------+      +-------------------+
| Vector Database   |<-----|     Embedder      |<-----|      Chunks       |
| (Stores vectors)  |      | (Creates vectors) |      | (Pieces of text)  |
+-------------------+      +-------------------+      +-------------------+


=========================
| PHASE 2: RUNTIME      | (Happens every time a user asks a question)
=========================

+-------------------+      +-------------------+      +-------------------+
| User's Question   |----->|     Embedder      |----->|    Retriever      |
| ("What is RAG?")  |      | (Vectorizes query)|      | (Finds similar    |
|                   |      |                   |      | chunks in DB)     |
+-------------------+      +-------------------+      +-------------------+
                                                               |
                                                               v
+-------------------+      +-------------------+      +-------------------+
|   Final Prompt    |<-----|  Retrieved Chunks |<-----|   User's Question |
| (Question+Context)|      | (Relevant context)|      |                   |
+-------------------+      +-------------------+      +-------------------+
      |
      v
+-------------------+
|        LLM        |-----> Final, context-aware answer
+-------------------+

Phase 1: The Indexing Pipeline (Preparing the “Textbook”)

This is the offline process where we prepare our knowledge base.

Loading: You start with your documents (PDFs, Word docs, website content, Notion pages, etc.). A Loader component reads this data into a standardized text format.
Splitting (or “Chunking”): A full document is too large to fit in a model’s context window. A Splitter breaks the text down into smaller, meaningful chunks (e.g., paragraphs or sentences). This is one of the most critical steps for high-quality RAG.
Embedding: Each text chunk is fed into an Embedding Model. This model converts the text into a numerical vector—a list of numbers that represents the semantic meaning of the text. Chunks with similar meanings will have similar vectors.
Storing: These vectors, along with the original text chunks they represent, are loaded into a specialized Vector Database (like Chroma, Pinecone, or FAISS). This database is optimized for incredibly fast similarity searches.

Phase 2: The Runtime Pipeline (Answering the Question)

This happens in real-time when your agent receives a user query.

User Query & Embedding: The user asks a question, like “What are the benefits of RAG?” This question is passed through the same embedding model to create a query vector.
Retrieval: The system takes this query vector and performs a similarity search in the vector database. It finds the text chunks whose vectors are mathematically “closest” to the query vector. These chunks are the most relevant pieces of information in your entire knowledge base for answering that specific question.
Augmentation: The system constructs a new prompt. This prompt includes the original user question and injects the retrieved text chunks as “context.”

Example Prompt:

You are a helpful AI assistant. Use the following context to answer the user's question. If you don't know the answer from the context, say you don't know.
Context:
[Chunk 34]: RAG reduces the likelihood of factual inaccuracies, or 'hallucinations', by grounding the model in provided text.
[Chunk 87]: Because RAG allows for citing sources, it increases user trust and allows for fact-checking.
User Question: What are the benefits of RAG?

Generation: This augmented prompt is sent to the LLM. The LLM now has the precise, external knowledge it needs to generate a high-quality, factually grounded answer.

Why RAG is a Game-Changer for Agents

Reduces Hallucinations: The primary reason for RAG’s popularity. By grounding the LLM in specific, provided text, it dramatically reduces the chance that the model will invent or “hallucinate” incorrect facts.
Enables Up-to-Date & Proprietary Knowledge: This is how you make an agent an expert on your data. Your agent can now answer questions about your company’s internal wiki, your legal team’s contracts, or news articles published just five minutes ago.
Cost-Effective and Fast: Compared to fine-tuning, which is slow and expensive, updating a RAG system is as simple as adding a new document to your vector database. It’s an efficient way to expand your agent’s knowledge.
Provides Verifiability and Trust: Because you know exactly which chunks were retrieved to generate an answer, you can build systems that cite their sources. This allows users to verify the information for themselves, which is critical for building trust in your application.

Conclusion

RAG transforms the LLM from a closed-book, and sometimes forgetful, genius into an open-book, diligent researcher. It is the most common and powerful pattern for giving your agents the specific, accurate, and up-to-date knowledge they need to solve real-world problems. For nearly any agent that needs to be an “expert” in a specific domain, RAG is the foundational architecture you’ll build upon.

Author

Debjeet Bhowmik

Experienced Cloud & DevOps Engineer with hands-on experience in AWS, GCP, Terraform, Ansible, ELK, Docker, Git, GitLab, Python, PowerShell, Shell, and theoretical knowledge on Azure, Kubernetes & Jenkins. In my free time, I write blogs on ckdbtech.com