RAG Fundamentals: How Retrieval-Augmented Generation Works

Think of an LLM as a brilliant expert who graduated years ago and never read anything since. They can reason, explain, and synthesize — but only from what they learned in school. Retrieval-Augmented Generation is the equivalent of handing that expert a stack of relevant documents before they answer your question.

This post breaks down how RAG works from first principles, then walks you through building a working RAG chatbot in both Python and Node.js. If you want the big-picture view first, start with The Complete Guide to RAG.

What Is RAG?

Retrieval-Augmented Generation (RAG) is a pattern where an LLM's input is augmented with relevant information retrieved from an external knowledge base at inference time. Instead of relying on knowledge baked into model weights during training, the model gets fresh, specific context with every query.

The formal definition from the original 2020 paper by Lewis et al.: a model that combines a pre-trained parametric memory (the LLM) with a non-parametric memory (a retrieval index) to produce more accurate, grounded outputs.

In practice, this means: search first, then generate.

The Retriever-Generator Pattern

Every RAG system has two core components:

The Retriever

The retriever's job is to find the most relevant pieces of information for a given query. It searches through a pre-built index of documents (usually stored as vector embeddings) and returns the top-K most similar results.

The retriever doesn't understand the question or generate answers — it just finds relevant text. Think of it as a highly specialized search engine optimized for semantic similarity rather than keyword matching.

The Generator

The generator is the LLM. It receives the original user query plus the retrieved documents as context, and produces a natural language response. The key difference from standard LLM completion: the generator is grounded in specific retrieved evidence, not just its training data.

The Pipeline

User Question
      │
      ▼
┌─────────────┐
│  Retriever   │──── searches ──── Vector Store
└─────────────┘                    (your documents)
      │
      │ top-K relevant chunks
      ▼
┌─────────────┐
│  Generator   │──── LLM (GPT-4, Claude, Llama, etc.)
│  (Augmented  │
│   Prompt)    │
└─────────────┘
      │
      ▼
  Grounded Response

How Embeddings Work

Embeddings are the foundation of semantic retrieval. They transform text into dense numerical vectors — arrays of floating-point numbers (typically 384 to 3072 dimensions) that capture the semantic meaning of the input.

The Key Insight

Texts with similar meanings produce vectors that are close together in vector space. "How to train a neural network" and "Steps for building a deep learning model" will have high cosine similarity, even though they share almost no words. Conversely, "Java programming language" and "Java island in Indonesia" will have very different embeddings despite sharing the same keyword.

This is what makes RAG fundamentally different from keyword search. The retriever understands meaning, not just keywords. Semantic search retrieves based on intent and concepts, enabling more natural interactions with knowledge bases.

How Embedding Models Work

Embedding models (like OpenAI's text-embedding-3-small or the open-source all-MiniLM-L6-v2) are transformer-based models trained on massive datasets of text pairs. They learn to map semantically similar inputs to nearby points in vector space.

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    input="What is retrieval-augmented generation?",
    model="text-embedding-3-small"
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")  # 1536
print(f"First 5 values: {vector[:5]}")
# [-0.0123, 0.0456, -0.0789, 0.0234, -0.0567]

The RAG Workflow: End to End

A complete RAG system has two phases: an offline indexing phase and an online query phase.

Phase 1: Document Ingestion (Offline)

Before you can retrieve anything, you need to build your knowledge base.

Step 1: Load Documents Collect your source material — PDFs, web pages, Markdown files, database exports, API responses.

Step 2: Chunk Documents Split documents into smaller passages. LLMs have context window limits, and smaller chunks improve retrieval precision. Common strategies include fixed-size chunks (e.g., 512 tokens with 50-token overlap) and recursive character splitting.

Step 3: Generate Embeddings Pass each chunk through an embedding model to produce a vector representation.

Step 4: Store in Vector Database Save the vectors (plus the original text and metadata) in a vector store like Chroma, FAISS, Pinecone, or Weaviate.

Phase 2: Query and Generate (Online)

Step 1: Embed the Query Convert the user's question into a vector using the same embedding model.

Step 2: Retrieve Similar Chunks Search the vector store for the K most similar document chunks (typically K = 3–10).

Step 3: Augment the Prompt Construct a prompt that includes the user's question and the retrieved chunks as context.

Step 4: Generate Response Send the augmented prompt to the LLM, which generates a response grounded in the retrieved evidence.

For a detailed comparison of vector databases and retrieval strategies, see RAG Architecture Deep Dive.

Code Example: RAG Chatbot in Python

Here's a complete RAG pipeline using LangChain and Chroma:

import os
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

os.environ["OPENAI_API_KEY"] = "your-api-key"

# 1. Load documents
loader = DirectoryLoader("./docs", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# 3. Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)

# 4. Build RAG chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

template = """Answer the question based only on the following context.
If you cannot answer from the context, say so.

Context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model="gpt-4", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 5. Query
response = rag_chain.invoke("What are the main benefits of RAG?")
print(response)

Code Example: RAG in Node.js

Here's the equivalent using LangChain.js with an in-memory vector store:

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { Document } from "@langchain/core/documents";
import { readFileSync, readdirSync } from "fs";

// 1. Load documents
const docsDir = "./docs";
const files = readdirSync(docsDir).filter((f) => f.endsWith(".txt"));
const documents = files.map(
  (f) =>
    new Document({
      pageContent: readFileSync(`${docsDir}/${f}`, "utf-8"),
      metadata: { source: f },
    })
);

// 2. Split into chunks
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 512,
  chunkOverlap: 50,
});
const chunks = await splitter.splitDocuments(documents);

// 3. Create vector store
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
});
const vectorStore = await MemoryVectorStore.fromDocuments(chunks, embeddings);

// 4. Build RAG chain
const retriever = vectorStore.asRetriever({ k: 5 });

const prompt = ChatPromptTemplate.fromTemplate(`
Answer the question based only on the following context.
If you cannot answer from the context, say so.

Context: {context}

Question: {input}
`);

const llm = new ChatOpenAI({ model: "gpt-4", temperature: 0 });

const combineDocsChain = await createStuffDocumentsChain({
  llm,
  prompt,
});

const retrievalChain = await createRetrievalChain({
  retriever,
  combineDocsChain,
});

// 5. Query
const response = await retrievalChain.invoke({
  input: "What are the main benefits of RAG?",
});
console.log(response.answer);

RAG vs Standard LLM Completion

Aspect	Standard LLM	RAG
Knowledge	Frozen at training time	Fresh from external sources
Accuracy	Prone to hallucination	Grounded in retrieved evidence
Citations	Cannot cite sources	Can reference specific documents
Domain specificity	General knowledge only	Access to your private data
Cost to update	Retrain or fine-tune	Update the document index
Latency	Single LLM call	Retrieval + LLM call
Complexity	Simple API call	Indexing pipeline + retrieval + generation

Common Beginner Mistakes

Using Chunks That Are Too Large or Too Small

Chunks that are too large (2000+ tokens) dilute the relevant signal with noise. Chunks that are too small (under 100 tokens) lose context. Start with 300–600 tokens and tune based on retrieval quality.

Ignoring Chunk Overlap

Without overlap, information that spans a chunk boundary gets split and may never be retrieved together. A 10–15% overlap (e.g., 50 tokens for a 512-token chunk) prevents this.

Using Different Embedding Models for Indexing and Querying

The embedding model used to index documents must be the same model used to embed queries. Mixing models produces incompatible vector spaces and garbage retrieval results.

Skipping Evaluation

Many developers build a RAG pipeline and ship it without measuring retrieval quality. You should track metrics like recall@K (are the relevant documents in the top-K results?) and answer faithfulness (does the response match the retrieved context?). See Building a Production RAG System for evaluation frameworks.

Stuffing Too Many Documents Into Context

More retrieved documents isn't always better. Research shows that LLMs struggle to use information in the middle of long contexts ("lost in the middle" effect). Retrieve fewer, higher-quality chunks rather than flooding the context window.

Not Testing With Real Queries

Developers often build and evaluate RAG pipelines with synthetic or trivially simple queries. Test with the actual questions your users will ask — including ambiguous, multi-part, and adversarial queries. The gap between demo queries and real queries is where most RAG systems fail.

Ignoring Metadata

Many RAG implementations throw away valuable metadata during chunking — source document title, section heading, creation date, author, document type. This metadata is crucial for filtering at retrieval time (e.g., "only search in the engineering docs from the last year") and for citations in the generated response. Always preserve and index metadata alongside your chunks.

What's Next

Now that you understand the fundamentals, dive deeper into the technical architecture — embeddings models, vector database selection, and retrieval strategies — in RAG Architecture Deep Dive: Embeddings, Vector Databases, and Retrieval.

For the broader context of where RAG fits in the AI landscape, see The Complete Guide to RAG.

References

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
LangChain Documentation. RAG Tutorial.
LangChain.js Documentation. RAG Quickstart.
Liu, N., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv.
OpenAI. Embeddings Guide.

FAQ

How many documents should I retrieve (what's the right K)?

Start with K = 3–5 and evaluate. Smaller K gives more focused context but risks missing relevant information. Larger K provides more coverage but can introduce noise and hit context window limits. The optimal K depends on your chunk size, context window, and use case.

Can I use RAG without an API like OpenAI?

Yes. You can run the entire RAG stack locally using open-source models. Use sentence-transformers for embeddings, Chroma or FAISS for vector storage, and Ollama or vLLM to serve open-source LLMs like Llama 3 or Mistral.

What file types can RAG handle?

RAG can work with any text-based content. Document loaders exist for PDFs, Word docs, HTML, Markdown, CSV, JSON, code files, and more. The key is converting the content to clean text before chunking and embedding.

How do I handle documents that change frequently?

Implement an incremental indexing pipeline that detects changes and re-embeds only modified documents. Most vector databases support upsert operations that update existing records. For rapidly changing data (e.g., stock prices), consider a hybrid approach with real-time API lookups.

What's the difference between RAG and semantic search?

Semantic search finds relevant documents using embeddings and vector similarity — it's the retrieval part of RAG. RAG adds a generation step: feeding those documents to an LLM to produce a synthesized natural language answer, rather than just returning a list of search results.