Blog Post #42: From Raw Files to Queryable Knowledge: Building Your First Data Index

In our last post, we saw LlamaIndex create a queryable engine from a text file in just a few lines. It felt almost magical. But what was really happening under the hood? How do you go from a messy folder of raw documents—PDFs, web pages, text files—to a structured, intelligent knowledge base that an AI can use?

This post is a practical, hands-on guide to the first and most critical phase of any Retrieval-Augmented Generation (RAG) application: Indexing. We will take a collection of local documents and build a persistent, queryable vector index using LlamaIndex’s powerful and intuitive data ingestion pipeline.

This process is how you create a custom “long-term memory” or a specialized “textbook” for your AI agent.


Part 1: Preparing Your Knowledge Base

First, let’s gather the documents that will form our agent’s knowledge.

Step 1: Create Your data Folder

In your project directory, create a new folder named data. This is where you’ll place all your source files.

Step 2: Add Your Documents

LlamaIndex can handle a huge variety of file types. For this tutorial, we’ll create two simple text files. The same process would work if one of these was a PDF or a Word document.

Create data/sodepur_history.txt:

The history of Sodepur is ancient, closely linked with the Sena dynasty of Bengal. The region flourished as a significant hub during the medieval period under the Bengal Sultanate and was well-known for its vibrant riverside trade along the Hooghly river. It played a role in the Bengal Renaissance.

Create data/sodepur_culture.txt:

Sodepur is famous for its rich religious and cultural heritage, particularly its connection to the Vaishnava tradition. The Shyamsundar Temple is a central landmark. Furthermore, the annual Ghoshpara fair, a major event for the Kartabhaja sect, attracts thousands of pilgrims from across the state, celebrating a unique syncretic faith.

Step 3: Install Necessary Libraries

Make sure your virtual environment is activated and install the libraries we’ll need.

pip install llama-index llama-index-llms-openai pypdf

(Note: pypdf is a common dependency for handling PDF files, which SimpleDirectoryReader can use automatically.)


Part 2: The LlamaIndex Ingestion Pipeline

Now we’ll write a script, let’s call it build_index.py, to process these files.

Step 1: Loading with SimpleDirectoryReader

This is LlamaIndex’s “magic” loader. You point it at a folder, and it automatically detects the file types and uses the correct loader for each one (.txt, .pdf, .docx, .md, etc.), loading them into memory.

# build_index.py
from llama_index.core import SimpleDirectoryReader

print("Loading documents...")
# This will load all files in the './data' directory
documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} document(s).")

Step 2: Parsing into Nodes (SentenceSplitter)

Once loaded, the Documents must be broken down into smaller chunks, or Nodes. This is handled by a Node Parser. A great default choice is the SentenceSplitter, which intelligently splits text while trying to respect sentence boundaries.

# build_index.py (continued)
from llama_index.core.node_parser import SentenceSplitter

# Create a node parser with a specific chunk size
node_parser = SentenceSplitter(chunk_size=200, chunk_overlap=20)

# Get nodes from the documents
nodes = node_parser.get_nodes_from_documents(documents)
print(f"Split documents into {len(nodes)} nodes.")

Step 3: Embedding and Indexing (VectorStoreIndex)

This is the final step of the initial build. We take our list of Nodes, and the VectorStoreIndex handles the process of creating a numerical embedding for each one and storing them in an efficient, searchable structure.

# build_index.py (continued)
from llama_index.core import VectorStoreIndex

print("Creating index...")
# This will use the default OpenAI embedding model
index = VectorStoreIndex(nodes)
print("Index created successfully.")

Part 3: Persisting Your Index (Don’t Rebuild Every Time!)

The index we just built is in-memory. If the script ends, it’s gone. Re-embedding documents on every run is slow and can be expensive if you’re using a paid embedding API. The professional workflow is to build the index once and save it to disk.

LlamaIndex makes this incredibly easy.

# build_index.py (continued)

# Define a directory to store the index
PERSIST_DIR = "./storage"

# Save the index to disk
index.storage_context.persist(persist_dir=PERSIST_DIR)
print(f"Index saved to {PERSIST_DIR}")

Now, in your main application, you can simply load this pre-built index instead of rebuilding it every time.


Part 4: The Full Workflow: Querying Your Knowledge

Let’s create our main application file, main.py, which will use our persistent index. This script will check if an index exists. If it does, it loads it; if not, it builds and saves it.

# main.py
import os
from dotenv import load_dotenv
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

load_dotenv()
PERSIST_DIR = "./storage"

if not os.path.exists(PERSIST_DIR):
    # Load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # Store it for next time
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # Load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

# Now we can create a query engine and ask questions
query_engine = index.as_query_engine()

print("--- Querying cultural information ---")
response1 = query_engine.query("What is the religious heritage of Sodepur?")
print(response1)

print("\n--- Querying historical information ---")
response2 = query_engine.query("Tell me about Sodepur's history during the medieval period.")
print(response2)

When you run python main.py for the first time, it will build the index. On every subsequent run, it will load it instantly from the ./storage directory.

Notice how the AI can now answer specific questions by finding and combining information from both of our source files, acting as a true expert on the provided material.

Conclusion

You have now mastered the most fundamental process in any RAG application: the ingestion and indexing pipeline. You’ve transformed a chaotic folder of raw files into a structured, persistent, and queryable knowledge base.

This indexed data is the “long-term memory” for your AI. LlamaIndex offers dozens of other loaders for services like Notion and Slack, as well as more advanced indexing strategies, but the core pattern you’ve learned today—Load -> Parse -> Index -> Persist—is the foundation for them all. You now have the power to give your agent a custom library on any topic you choose.

Author

Debjeet Bhowmik

Experienced Cloud & DevOps Engineer with hands-on experience in AWS, GCP, Terraform, Ansible, ELK, Docker, Git, GitLab, Python, PowerShell, Shell, and theoretical knowledge on Azure, Kubernetes & Jenkins. In my free time, I write blogs on ckdbtech.com

Leave a Comment