Building a Production-Ready RAG System: From Prototype to Deployment
Most RAG tutorials end where the real work begins. They show you how to load a PDF, split it into chunks, embed it, and query it — and that's fine for a demo. But the distance between that demo and a system that handles thousands of users, diverse document formats, stale data, adversarial queries, and evolving requirements is enormous.
This guide covers what it actually takes to build a production RAG system: the tech stack decisions, ingestion pipelines, chunking strategies, evaluation frameworks, and deployment architecture you need to get right before going live. If you're still getting oriented with RAG concepts, start with the complete guide to RAG. For real-world examples of these patterns in action, see the RAG case studies.
Tech Stack Overview
A production RAG system has five core layers. Your choices at each layer shape the system's capabilities, cost profile, and operational complexity.
| Layer | Options | Considerations |
|---|---|---|
| Orchestration | LangChain, LlamaIndex, Haystack, custom | LangChain for flexibility, LlamaIndex for document-heavy workloads, custom for maximum control |
| Embedding models | OpenAI text-embedding-3-small/large, Cohere Embed v3, open-source (BGE, E5) | Cost vs. quality vs. data privacy. Open-source models avoid sending data to external APIs |
| Vector database | Pinecone, Weaviate, Qdrant, pgvector, Milvus, ChromaDB | Managed vs. self-hosted, scale requirements, metadata filtering capabilities |
| LLM | GPT-4o, Claude 3.5 Sonnet, Llama 3.1, Mixtral | Latency, cost, context window size, instruction following quality |
| Infrastructure | AWS, GCP, Azure, hybrid | Existing cloud commitments, data residency requirements, GPU availability |
Don't over-optimize tech stack decisions upfront. Start with managed services (OpenAI embeddings, Pinecone, GPT-4o) to validate the product. Migrate to self-hosted components when cost or data privacy demands it.
Data Ingestion Pipeline
The ingestion pipeline transforms raw documents into indexed, searchable chunks. This is where most production RAG complexity lives.
Document Loaders
Real-world knowledge bases span dozens of formats. Your ingestion pipeline needs to handle all of them reliably.
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredHTMLLoader,
CSVLoader,
TextLoader,
UnstructuredMarkdownLoader,
)
from pathlib import Path
LOADER_MAP = {
".pdf": PyPDFLoader,
".html": UnstructuredHTMLLoader,
".htm": UnstructuredHTMLLoader,
".csv": CSVLoader,
".txt": TextLoader,
".md": UnstructuredMarkdownLoader,
}
def load_document(file_path: str) -> list:
ext = Path(file_path).suffix.lower()
loader_cls = LOADER_MAP.get(ext)
if not loader_cls:
raise ValueError(f"Unsupported file type: {ext}")
loader = loader_cls(file_path)
docs = loader.load()
for doc in docs:
doc.metadata.update({
"source": file_path,
"file_type": ext,
"ingested_at": datetime.utcnow().isoformat(),
})
return docs
Text Extraction and Preprocessing
Raw document loading is just the first step. Production pipelines need preprocessing to clean extracted text:
- Whitespace normalization: PDFs often produce inconsistent spacing, repeated newlines, and Unicode artifacts.
- Header/footer removal: recurring page headers and footers in PDFs add noise to every chunk.
- Table extraction: embedded tables need structure-preserving extraction, not linearized text.
- Encoding handling: legacy documents in non-UTF-8 encodings require detection and conversion.
import re
import unicodedata
def preprocess_text(text: str) -> str:
text = unicodedata.normalize("NFKC", text)
text = re.sub(r"\n{3,}", "\n\n", text)
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"Page \d+ of \d+", "", text)
text = text.strip()
return text
Metadata Enrichment
Metadata attached to chunks at ingestion time powers filtering, access control, and retrieval quality at query time. Invest in rich metadata upfront.
Essential metadata fields:
- Source: file path, URL, or database reference
- Document type: policy, API doc, tutorial, FAQ
- Created/updated timestamps: for freshness-weighted retrieval
- Author/owner: for access control and attribution
- Section hierarchy: chapter → section → subsection for context
Chunking Strategies
Chunking is the single highest-leverage decision in a RAG pipeline. Bad chunking produces irrelevant retrievals regardless of how good your embedding model or vector database is.
Fixed-Size Chunking
The simplest approach: split text into chunks of N characters (or tokens) with an overlap window.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separator="\n\n",
)
chunks = splitter.split_documents(documents)
Pros: simple, predictable chunk sizes, easy to reason about token budgets. Cons: splits mid-sentence, mid-paragraph, or mid-thought. A chunk boundary in the middle of a code example or multi-step explanation degrades retrieval quality.
Recursive Character Text Splitting
Splits hierarchically — first by double newline (paragraphs), then single newline, then sentence, then character. This preserves semantic boundaries better than fixed-size splitting.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
This is the default recommendation for most use cases. It balances semantic preservation with predictable sizing.
Semantic Chunking
Groups sentences by semantic similarity rather than character count. Adjacent sentences with high embedding similarity stay together; a drop in similarity triggers a chunk boundary.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75,
)
chunks = splitter.split_documents(documents)
Pros: chunks align with topic shifts, producing more coherent retrieval units. Cons: requires embedding each sentence during ingestion (costly at scale), chunk sizes are unpredictable, and the quality depends on the embedding model's ability to capture semantic boundaries.
Parent-Child Document Chunking
Indexes small chunks for precise retrieval but returns the parent document (or a larger surrounding window) for generation context. This solves the fundamental tension between retrieval precision (small chunks match better) and generation quality (larger context produces better answers).
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(documents)
For production use, replace InMemoryStore with a persistent store (Redis, PostgreSQL) so parent documents survive restarts.
Embedding Pipeline
Model Selection
Your embedding model determines the quality ceiling of your retrieval. The right choice depends on your domain, languages, and cost constraints.
| Model | Dimensions | Strengths | Considerations |
|---|---|---|---|
text-embedding-3-small (OpenAI) | 1536 | Good general-purpose, low cost | Requires API calls, data leaves your infra |
text-embedding-3-large (OpenAI) | 3072 | Higher quality, supports dimension reduction | 2x cost of small, same data privacy concern |
| Cohere Embed v3 | 1024 | Strong multilingual, input type parameter | API dependency |
| BGE-large-en-v1.5 | 1024 | Open-source, self-hostable | Requires GPU for reasonable throughput |
| E5-mistral-7b | 4096 | Highest quality open-source | Large model, significant GPU requirements |
Batch Processing
Embedding thousands of documents one at a time is slow and wasteful. Batch your embedding calls.
from langchain_openai import OpenAIEmbeddings
import asyncio
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
chunk_size=1000, # batch size for API calls
)
async def embed_documents_batched(texts: list[str], batch_size: int = 500) -> list:
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
batch_embeddings = await embeddings.aembed_documents(batch)
all_embeddings.extend(batch_embeddings)
return all_embeddings
Caching Embeddings
Recomputing embeddings for unchanged documents is pure waste. Implement content-based caching:
import hashlib
import json
def content_hash(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()
class EmbeddingCache:
def __init__(self, cache_store):
self.cache = cache_store
self.model = OpenAIEmbeddings(model="text-embedding-3-small")
async def get_or_compute(self, texts: list[str]) -> list:
results = [None] * len(texts)
to_compute = []
for i, text in enumerate(texts):
h = content_hash(text)
cached = self.cache.get(h)
if cached:
results[i] = json.loads(cached)
else:
to_compute.append((i, text, h))
if to_compute:
new_texts = [t for _, t, _ in to_compute]
new_embeddings = await self.model.aembed_documents(new_texts)
for (i, _, h), emb in zip(to_compute, new_embeddings):
results[i] = emb
self.cache.set(h, json.dumps(emb))
return results
Retrieval Optimization
Hybrid Search (Dense + Sparse)
Pure vector search misses exact keyword matches. Pure BM25 misses semantic similarity. Combine them.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(documents, k=10)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7],
)
results = hybrid_retriever.invoke("How to configure OAuth2 in the API gateway")
The weight split (30% BM25, 70% dense) is a starting point. Tune it on your evaluation dataset. Technical documentation and code-heavy domains often benefit from higher BM25 weights (40–50%).
Query Transformation
User queries are often vague, incomplete, or poorly phrased for retrieval. Transform them before searching.
HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, then embed that instead of the query. The hypothetical answer is closer in embedding space to real documents than the short query.
from langchain.chains import HypotheticalDocumentEmbedder
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
base_embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
prompt_key="web_search",
)
Multi-query retrieval: generate multiple query variations and merge results. This improves recall for ambiguous queries.
Contextual Compression
Retrieved chunks often contain irrelevant surrounding text. A contextual compressor extracts only the parts relevant to the query, improving the signal-to-noise ratio in the generation prompt.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=hybrid_retriever,
)
This adds an LLM call per retrieval, so use it selectively — on the top 5 results after reranking, not on 50 raw retrieval results.
Evaluation Framework
You can't improve what you can't measure. Production RAG systems need continuous evaluation across multiple dimensions.
Core Metrics
| Metric | What It Measures | How to Compute |
|---|---|---|
| Faithfulness | Is the response grounded in retrieved context? | LLM-as-judge: check each claim against source chunks |
| Answer relevance | Does the response actually answer the question? | LLM-as-judge: score answer against original query |
| Context precision | Are the retrieved chunks relevant to the query? | Ratio of relevant chunks in top-k results |
| Context recall | Did retrieval find all relevant information? | Compare retrieved info against ground truth answers |
RAGAS Framework
RAGAS provides automated evaluation for RAG pipelines. It generates scores without requiring human-labeled ground truth for every question.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
eval_dataset = Dataset.from_dict({
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": reference_answers,
})
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
# 'context_precision': 0.78, 'context_recall': 0.82}
Evaluation Script
Here's a complete evaluation pipeline that you can run against your RAG system:
import json
from datetime import datetime
from dataclasses import dataclass, asdict
from langchain_openai import ChatOpenAI
@dataclass
class EvalResult:
question: str
generated_answer: str
retrieved_chunks: list[str]
faithfulness_score: float
relevance_score: float
retrieval_precision: float
latency_ms: float
def evaluate_faithfulness(answer: str, contexts: list[str], llm) -> float:
context_text = "\n---\n".join(contexts)
prompt = f"""Given the following context and answer, score how well the answer
is supported by the context. Return a score from 0.0 to 1.0.
Context:
{context_text}
Answer:
{answer}
Score (0.0-1.0):"""
response = llm.invoke(prompt)
try:
return float(response.content.strip())
except ValueError:
return 0.0
def run_evaluation(rag_chain, eval_questions: list[dict], output_path: str):
llm = ChatOpenAI(model="gpt-4o", temperature=0)
results = []
for item in eval_questions:
start = datetime.now()
response = rag_chain.invoke(item["question"])
latency = (datetime.now() - start).total_seconds() * 1000
faith_score = evaluate_faithfulness(
response["answer"], response["contexts"], llm
)
results.append(EvalResult(
question=item["question"],
generated_answer=response["answer"],
retrieved_chunks=response["contexts"],
faithfulness_score=faith_score,
relevance_score=0.0, # compute similarly
retrieval_precision=0.0, # requires ground truth
latency_ms=latency,
))
with open(output_path, "w") as f:
json.dump([asdict(r) for r in results], f, indent=2)
avg_faith = sum(r.faithfulness_score for r in results) / len(results)
avg_latency = sum(r.latency_ms for r in results) / len(results)
print(f"Average faithfulness: {avg_faith:.2f}")
print(f"Average latency: {avg_latency:.0f}ms")
print(f"Results written to {output_path}")
Human Evaluation
Automated metrics get you 80% of the way. The remaining 20% requires human review, especially for subjective quality dimensions like "helpfulness" and "completeness." Build a lightweight human evaluation workflow:
- Sample 50–100 queries per week from production traffic.
- Have domain experts rate responses on a 1–5 scale for relevance, accuracy, and completeness.
- Track scores over time to detect regressions.
- Feed low-scoring examples back into your evaluation dataset.
Deployment Architecture
Infrastructure
A production RAG system has multiple services that need to scale independently.
┌─────────────┐
│ Load │
│ Balancer │
└──────┬──────┘
│
┌──────┴──────┐
│ API │
│ Gateway │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌─────┴─────┐ ┌───┴───┐ ┌─────┴─────┐
│ Query │ │ Embed │ │ Ingestion │
│ Service │ │ Svc │ │ Workers │
└─────┬─────┘ └───┬───┘ └─────┬─────┘
│ │ │
┌─────┴────────────┴────────────┴─────┐
│ Vector Database │
│ (Pinecone / Qdrant / Weaviate) │
└──────────────────────────────────────┘
Query Service handles incoming queries, orchestrates retrieval, and calls the LLM. Scale horizontally based on QPS.
Embedding Service wraps the embedding model. If using an API (OpenAI), this is a thin proxy with rate limiting and retry logic. If self-hosting, this runs the model on GPU instances.
Ingestion Workers process new documents asynchronously. These are triggered by file uploads, webhook events, or scheduled crawls. Scale based on ingestion backlog.
Monitoring and Observability
You need visibility into every stage of the pipeline:
- Retrieval quality: log queries, retrieved chunk IDs, and relevance scores. Dashboards showing retrieval precision over time catch silent degradation.
- Generation quality: log prompts, responses, and automated faithfulness scores. Alert on faithfulness drops below your threshold.
- Latency breakdown: instrument each stage (embedding, retrieval, reranking, generation) independently. A P95 latency spike in reranking is invisible if you only measure end-to-end latency.
- Cost tracking: log token usage per query. A sudden cost increase might indicate a prompt engineering regression or an agent loop issue (see Agentic RAG for loop risks).
Cost Optimization
RAG costs come from three sources: embedding computation, vector database operations, and LLM inference.
| Cost Driver | Optimization |
|---|---|
| Embedding API calls | Cache embeddings by content hash; batch operations; use smaller models where quality is sufficient |
| Vector DB storage | Reduce dimensions (Matryoshka embeddings); quantize vectors; prune stale documents |
| LLM inference | Use smaller models for simple queries (GPT-4o-mini); cache frequent query-response pairs; truncate context to relevant chunks only |
| Reranking | Only rerank top-N candidates, not full result set; use lighter rerankers for simple queries |
A typical production RAG query costs $0.01–0.05 with managed services. At 100K queries/month, that's $1,000–5,000/month — significant enough to warrant optimization but reasonable for most B2B use cases.
Scaling Strategies
Read scaling: vector databases handle read scaling well. Pinecone and Weaviate scale horizontally with replicas. Qdrant supports distributed deployments.
Write scaling: ingestion is typically the bottleneck. Use message queues (SQS, RabbitMQ) to buffer ingestion requests and process them with auto-scaling workers.
LLM scaling: the LLM is usually the latency bottleneck. Strategies include request batching, model caching with vLLM for self-hosted models, and routing simple queries to faster/cheaper models.
Complete Ingestion Pipeline
Here's a production-grade ingestion pipeline that ties together document loading, preprocessing, chunking, embedding, and indexing:
import asyncio
import logging
from pathlib import Path
from datetime import datetime
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient
logger = logging.getLogger(__name__)
LOADER_MAP = {
".pdf": "PyPDFLoader",
".html": "UnstructuredHTMLLoader",
".md": "UnstructuredMarkdownLoader",
".txt": "TextLoader",
}
class ProductionIngestionPipeline:
def __init__(self, qdrant_url: str, collection_name: str):
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
self.embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
chunk_size=500,
)
self.client = QdrantClient(url=qdrant_url)
self.collection_name = collection_name
self.vectorstore = Qdrant(
client=self.client,
collection_name=collection_name,
embeddings=self.embeddings,
)
def load_document(self, file_path: str) -> list:
ext = Path(file_path).suffix.lower()
if ext not in LOADER_MAP:
logger.warning(f"Skipping unsupported file type: {ext}")
return []
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredHTMLLoader,
UnstructuredMarkdownLoader,
TextLoader,
)
loader_classes = {
".pdf": PyPDFLoader,
".html": UnstructuredHTMLLoader,
".md": UnstructuredMarkdownLoader,
".txt": TextLoader,
}
loader = loader_classes[ext](file_path)
return loader.load()
def preprocess(self, docs: list) -> list:
import re
import unicodedata
for doc in docs:
text = unicodedata.normalize("NFKC", doc.page_content)
text = re.sub(r"\n{3,}", "\n\n", text)
text = re.sub(r"[ \t]+", " ", text)
doc.page_content = text.strip()
return [doc for doc in docs if len(doc.page_content) > 50]
def chunk(self, docs: list) -> list:
chunks = self.splitter.split_documents(docs)
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_index"] = i
chunk.metadata["ingested_at"] = datetime.utcnow().isoformat()
return chunks
async def ingest_file(self, file_path: str) -> int:
logger.info(f"Ingesting {file_path}")
docs = self.load_document(file_path)
if not docs:
return 0
docs = self.preprocess(docs)
chunks = self.chunk(docs)
for chunk in chunks:
chunk.metadata["source"] = file_path
chunk.metadata["file_type"] = Path(file_path).suffix
await self.vectorstore.aadd_documents(chunks)
logger.info(f"Ingested {len(chunks)} chunks from {file_path}")
return len(chunks)
async def ingest_directory(self, dir_path: str) -> dict:
results = {"total_files": 0, "total_chunks": 0, "errors": []}
for file_path in Path(dir_path).rglob("*"):
if file_path.suffix.lower() in LOADER_MAP:
try:
count = await self.ingest_file(str(file_path))
results["total_files"] += 1
results["total_chunks"] += count
except Exception as e:
logger.error(f"Failed to ingest {file_path}: {e}")
results["errors"].append({"file": str(file_path), "error": str(e)})
return results
async def main():
pipeline = ProductionIngestionPipeline(
qdrant_url="http://localhost:6333",
collection_name="knowledge_base",
)
results = await pipeline.ingest_directory("./documents")
print(f"Ingested {results['total_files']} files, "
f"{results['total_chunks']} chunks, "
f"{len(results['errors'])} errors")
if __name__ == "__main__":
asyncio.run(main())
Managed vs Self-Hosted RAG Platforms
| Dimension | Managed (e.g., Pinecone + OpenAI) | Self-Hosted (e.g., Qdrant + BGE) |
|---|---|---|
| Setup time | Hours | Days to weeks |
| Operational overhead | Minimal (vendor manages infra) | Significant (you manage everything) |
| Cost at low scale | Low (pay per use) | Higher (minimum infrastructure) |
| Cost at high scale | Can become expensive | More predictable, often lower |
| Data privacy | Data sent to external APIs | Data stays in your infrastructure |
| Customization | Limited to vendor capabilities | Full control over every component |
| Latency | Depends on vendor SLAs | Optimizable for your workload |
| Vendor lock-in | Moderate to high | None |
| GPU management | Not your problem | Your problem (for self-hosted embeddings) |
For most teams starting out, the recommendation is: begin managed, migrate components to self-hosted as specific needs arise. Data privacy requirements (healthcare, finance, government) often force self-hosted embedding models first. Cost pressure at scale typically pushes vector database self-hosting second. LLM self-hosting is the last migration for most teams, as open models are still catching up on instruction-following quality for complex RAG prompts.
Putting It All Together
Building a production RAG system is a systems engineering problem, not an AI problem. The LLM and embedding model are commodities — the competitive advantage is in your data pipeline, evaluation framework, and operational infrastructure.
Start with the simplest architecture that could work:
- Recursive character text splitting with 1000-token chunks
- OpenAI
text-embedding-3-smallfor embeddings - A managed vector database (Pinecone or Weaviate Cloud)
- GPT-4o-mini for generation
- RAGAS for evaluation
Then iterate based on evidence from your evaluation metrics. Don't add hybrid search until you've measured that pure vector search is insufficient. Don't add reranking until you've measured that retrieval precision is the bottleneck. Don't add agentic patterns until you've confirmed that single-pass retrieval can't handle your query complexity — and when you do, see the agentic RAG guide for implementation patterns.
Every architectural decision should be backed by a measurable improvement on your evaluation dataset. If you can't measure it, you can't justify the complexity.
For foundational RAG concepts, revisit the RAG fundamentals guide. For a deep dive into retrieval and generation architectures, see the RAG architecture deep dive. And for deciding whether RAG is even the right approach for your use case, check RAG vs Fine-Tuning.
References
- Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
- Es, S., et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217.
- Gao, Y., et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv:2312.10997.
- Anthropic. (2024). "Contextual Retrieval." https://www.anthropic.com/news/contextual-retrieval
- LangChain Documentation. "RAG." https://python.langchain.com/docs/tutorials/rag/
- LlamaIndex Documentation. "Building a RAG Pipeline." https://docs.llamaindex.ai/
- Pinecone. "RAG Guide." https://www.pinecone.io/learn/retrieval-augmented-generation/
- Wang, L., et al. (2023). "Query2doc: Query Expansion with Large Language Models." arXiv:2303.07678.
FAQ
What's the minimum viable tech stack for a production RAG system?
An embedding model, a vector database, and an LLM — orchestrated by a framework like LangChain or LlamaIndex (or custom code). For the fastest path to production: OpenAI embeddings + Pinecone + GPT-4o-mini. This stack requires no GPU infrastructure, scales with managed services, and costs roughly $0.01–0.03 per query. Add complexity only when evaluation metrics justify it.
How often should I re-index my knowledge base?
It depends on how frequently your source documents change. For static documentation, a weekly full re-index is sufficient. For dynamic content (support tickets, wiki pages), implement incremental indexing triggered by change events. The key is tracking document versions — only re-embed chunks whose content has actually changed (use content hashing to detect changes).
How do I handle documents that are too large for a single chunk?
Use the parent-child document strategy: split large documents into small chunks for retrieval but store the relationship to parent sections. When a small chunk matches, return the full parent section (or surrounding chunks) for generation context. This gives you precise retrieval with rich generation context.
What's the most common reason production RAG systems fail?
Poor chunking. Teams invest in expensive embedding models and sophisticated retrieval algorithms while splitting documents with naive fixed-size chunking that breaks mid-thought. A well-chunked knowledge base with a basic embedding model outperforms a poorly-chunked one with the best model every time. Invest in domain-appropriate chunking before anything else.
How do I evaluate RAG quality without labeled ground truth data?
Use LLM-as-judge metrics from frameworks like RAGAS. Faithfulness (is the answer grounded in retrieved context?) and answer relevance (does the answer address the question?) can be computed without ground truth. For context precision and recall, you'll eventually need reference answers — start by manually labeling 50–100 representative queries. This small labeled set gives you a reliable benchmark while you build out a larger evaluation dataset.
Should I use one large vector collection or split into multiple?
Start with a single collection and use metadata filtering to segment content. Split into separate collections only when you have distinct content domains with different embedding models, different access control requirements, or when a single collection exceeds your vector database's performance limits. Multiple collections add operational complexity (query routing, index management, consistency) that isn't justified until you hit concrete scaling or isolation requirements.
Related Posts

Agentic RAG: Multi-Agent Systems, Planning, and Tool Integration
How agentic RAG combines retrieval-augmented generation with autonomous agents — ReAct patterns, chain-of-thought planning, memory systems, and building multi-agent RAG pipelines.

The Complete Guide to Retrieval-Augmented Generation (RAG)
Everything you need to know about RAG — from fundamentals and architecture to production deployment. The definitive guide for developers building AI systems with retrieval-augmented generation.

RAG Architecture Deep Dive: Embeddings, Vector Databases, and Retrieval
A technical deep dive into RAG architecture — embeddings models, vector database comparison (FAISS, Pinecone, Weaviate, Chroma), retrieval strategies, and system design patterns.