The Complete Guide to RAG Development Methodology

What Is RAG?

RAG (Retrieval-Augmented Generation) is a technique where an LLM retrieves external knowledge to generate responses. Think of it as an open-book exam. Instead of memorizing everything, the LLM looks up the information it needs from reference materials to answer questions.

LLMs don’t know information beyond their training data cutoff, and they can generate factually incorrect answers through hallucination. RAG solves both of these problems by retrieving external documents.

ApproachProsCons
Pure LLMFast responses, no extra infrastructureHallucination, lacks up-to-date information
Fine-tuningDomain-specific performanceHigh cost, difficult to update data
RAGReflects latest information, can cite sourcesDepends on retrieval quality, increased latency

The 3 Stages of RAG Architecture

A RAG pipeline consists of three stages: Indexing, Retrieval, and Generation.

Stage 1 — Indexing: Split documents into small chunks, convert them into vectors using an embedding model, and store them in a vector database.

Stage 2 — Retrieval: Vectorize the user’s question using the same embedding model and search the vector DB for chunks with the highest similarity.

Stage 3 — Generation: Include the retrieved chunks as context in the LLM prompt to generate the final answer.

Core Components

Embedding Models

Embedding is the process of converting text into high-dimensional vectors. Texts with similar meanings are placed close together in vector space.

ModelDimensionsFeatures
OpenAI text-embedding-3-small1536High performance, requires API call
sentence-transformers/all-MiniLM-L6-v2384Open source, can run locally
Cohere embed-v31024Excellent multilingual support

Vector Databases

A vector DB is a specialized database that stores embedding vectors and performs similarity searches. Popular options include ChromaDB (local/lightweight), Pinecone (cloud/managed), and FAISS (Meta open source, optimized for large-scale search).

Chunking Strategies

How you split documents significantly impacts RAG performance.

StrategyDescriptionBest For
Fixed sizeSplit by fixed token/character countUniform documents
Recursive splittingSplit by paragraph, then sentence, then wordGeneral text
Semantic splittingSplit by semantic unitsStructured documents

A chunk size of 500—1,000 tokens with 10—20% overlap is generally recommended.

Implementing Basic RAG with Python

Let’s implement a basic RAG pipeline using LangChain and ChromaDB. First, install the required packages.

# Install required packages
pip install langchain langchain-openai langchain-community chromadb

Step 1: Document Loading and Chunking

Load a text document and split it into appropriately sized chunks.

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load document
loader = TextLoader("docs/guide.txt", encoding="utf-8")
documents = loader.load()

# Configure recursive text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Max 500 characters per chunk
    chunk_overlap=50,      # 50-character overlap between chunks
    separators=["\n\n", "\n", " ", ""]  # Split priority order
)

# Split document into chunks
chunks = text_splitter.split_documents(documents)
print(f"Total {len(chunks)} chunks created")  # Total 12 chunks created

RecursiveCharacterTextSplitter first tries to split by paragraphs (\n\n), then by line breaks (\n), then by spaces ( ) if chunks are still too large. This creates natural chunks that preserve context.

Step 2: Generating Embeddings and Storing Vectors

Convert the split chunks into vectors and store them in ChromaDB.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize embedding model (requires OPENAI_API_KEY env variable)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Store vectors in ChromaDB
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"  # Persist to local disk
)

# Create retriever (returns top 3 results)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
print("Vector indexing complete")  # Vector indexing complete

By specifying persist_directory, vector data is saved to disk, eliminating the need to re-index when restarting the program.

Step 3: Retrieval + Generation (RAG Chain)

Use retrieved documents as context to have the LLM generate an answer.

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# RAG prompt template
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following context.
If the information is not in the context, respond with "I could not find that information."

Context: {context}
Question: {question}
""")

# Build RAG chain
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
answer = rag_chain.invoke("What are the main advantages of RAG?")
print(answer)

RunnablePassthrough() passes the user input as-is, while the retriever searches for relevant documents using the same input. Both results are combined in the prompt template and sent to the LLM.

RAG Performance Optimization Tips

Adjust chunk size: Chunks that are too small lack context, while chunks that are too large introduce noise. Experiment to find the optimal size for your domain.

Hybrid search: When vector similarity search alone isn’t enough, combining it with keyword search like BM25 improves retrieval accuracy.

Reranking: Re-sorting initial search results with a Cross-Encoder model places more relevant documents at the top.

Metadata filtering: Assign metadata such as dates and categories to documents, and filter during search to reduce irrelevant results.

Summary

RAG is a practical technique that compensates for LLM limitations through external knowledge retrieval. The key factors are a good chunking strategy, an appropriate embedding model, and efficient vector search. With LangChain and ChromaDB, you can build a basic RAG pipeline in just a few dozen lines of code — try loading your own documents and experimenting.

Was this article helpful?