Back to blog
·18 min read

Building a Production-Ready RAG System: From Prototype to Deployment

Most RAG tutorials end where the real work begins. They show you how to load a PDF, split it into chunks, embed it, and query it — and that's fine for a demo. But the distance between that demo and a system that handles thousands of users, diverse document formats, stale data, adversarial queries, and evolving requirements is enormous.

This guide covers what it actually takes to build a production RAG system: the tech stack decisions, ingestion pipelines, chunking strategies, evaluation frameworks, and deployment architecture you need to get right before going live. If you're still getting oriented with RAG concepts, start with the complete guide to RAG. For real-world examples of these patterns in action, see the RAG case studies.

Tech Stack Overview

A production RAG system has five core layers. Your choices at each layer shape the system's capabilities, cost profile, and operational complexity.

LayerOptionsConsiderations
OrchestrationLangChain, LlamaIndex, Haystack, customLangChain for flexibility, LlamaIndex for document-heavy workloads, custom for maximum control
Embedding modelsOpenAI text-embedding-3-small/large, Cohere Embed v3, open-source (BGE, E5)Cost vs. quality vs. data privacy. Open-source models avoid sending data to external APIs
Vector databasePinecone, Weaviate, Qdrant, pgvector, Milvus, ChromaDBManaged vs. self-hosted, scale requirements, metadata filtering capabilities
LLMGPT-4o, Claude 3.5 Sonnet, Llama 3.1, MixtralLatency, cost, context window size, instruction following quality
InfrastructureAWS, GCP, Azure, hybridExisting cloud commitments, data residency requirements, GPU availability

Don't over-optimize tech stack decisions upfront. Start with managed services (OpenAI embeddings, Pinecone, GPT-4o) to validate the product. Migrate to self-hosted components when cost or data privacy demands it.

Data Ingestion Pipeline

The ingestion pipeline transforms raw documents into indexed, searchable chunks. This is where most production RAG complexity lives.

Document Loaders

Real-world knowledge bases span dozens of formats. Your ingestion pipeline needs to handle all of them reliably.

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredHTMLLoader,
    CSVLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
)
from pathlib import Path

LOADER_MAP = {
    ".pdf": PyPDFLoader,
    ".html": UnstructuredHTMLLoader,
    ".htm": UnstructuredHTMLLoader,
    ".csv": CSVLoader,
    ".txt": TextLoader,
    ".md": UnstructuredMarkdownLoader,
}


def load_document(file_path: str) -> list:
    ext = Path(file_path).suffix.lower()
    loader_cls = LOADER_MAP.get(ext)
    if not loader_cls:
        raise ValueError(f"Unsupported file type: {ext}")

    loader = loader_cls(file_path)
    docs = loader.load()

    for doc in docs:
        doc.metadata.update({
            "source": file_path,
            "file_type": ext,
            "ingested_at": datetime.utcnow().isoformat(),
        })

    return docs

Text Extraction and Preprocessing

Raw document loading is just the first step. Production pipelines need preprocessing to clean extracted text:

  • Whitespace normalization: PDFs often produce inconsistent spacing, repeated newlines, and Unicode artifacts.
  • Header/footer removal: recurring page headers and footers in PDFs add noise to every chunk.
  • Table extraction: embedded tables need structure-preserving extraction, not linearized text.
  • Encoding handling: legacy documents in non-UTF-8 encodings require detection and conversion.
import re
import unicodedata


def preprocess_text(text: str) -> str:
    text = unicodedata.normalize("NFKC", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"Page \d+ of \d+", "", text)
    text = text.strip()
    return text

Metadata Enrichment

Metadata attached to chunks at ingestion time powers filtering, access control, and retrieval quality at query time. Invest in rich metadata upfront.

Essential metadata fields:

  • Source: file path, URL, or database reference
  • Document type: policy, API doc, tutorial, FAQ
  • Created/updated timestamps: for freshness-weighted retrieval
  • Author/owner: for access control and attribution
  • Section hierarchy: chapter → section → subsection for context

Chunking Strategies

Chunking is the single highest-leverage decision in a RAG pipeline. Bad chunking produces irrelevant retrievals regardless of how good your embedding model or vector database is.

Fixed-Size Chunking

The simplest approach: split text into chunks of N characters (or tokens) with an overlap window.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separator="\n\n",
)

chunks = splitter.split_documents(documents)

Pros: simple, predictable chunk sizes, easy to reason about token budgets. Cons: splits mid-sentence, mid-paragraph, or mid-thought. A chunk boundary in the middle of a code example or multi-step explanation degrades retrieval quality.

Recursive Character Text Splitting

Splits hierarchically — first by double newline (paragraphs), then single newline, then sentence, then character. This preserves semantic boundaries better than fixed-size splitting.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_documents(documents)

This is the default recommendation for most use cases. It balances semantic preservation with predictable sizing.

Semantic Chunking

Groups sentences by semantic similarity rather than character count. Adjacent sentences with high embedding similarity stay together; a drop in similarity triggers a chunk boundary.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75,
)

chunks = splitter.split_documents(documents)

Pros: chunks align with topic shifts, producing more coherent retrieval units. Cons: requires embedding each sentence during ingestion (costly at scale), chunk sizes are unpredictable, and the quality depends on the embedding model's ability to capture semantic boundaries.

Parent-Child Document Chunking

Indexes small chunks for precise retrieval but returns the parent document (or a larger surrounding window) for generation context. This solves the fundamental tension between retrieval precision (small chunks match better) and generation quality (larger context produces better answers).

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)

store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(documents)

For production use, replace InMemoryStore with a persistent store (Redis, PostgreSQL) so parent documents survive restarts.

Embedding Pipeline

Model Selection

Your embedding model determines the quality ceiling of your retrieval. The right choice depends on your domain, languages, and cost constraints.

ModelDimensionsStrengthsConsiderations
text-embedding-3-small (OpenAI)1536Good general-purpose, low costRequires API calls, data leaves your infra
text-embedding-3-large (OpenAI)3072Higher quality, supports dimension reduction2x cost of small, same data privacy concern
Cohere Embed v31024Strong multilingual, input type parameterAPI dependency
BGE-large-en-v1.51024Open-source, self-hostableRequires GPU for reasonable throughput
E5-mistral-7b4096Highest quality open-sourceLarge model, significant GPU requirements

Batch Processing

Embedding thousands of documents one at a time is slow and wasteful. Batch your embedding calls.

from langchain_openai import OpenAIEmbeddings
import asyncio

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    chunk_size=1000,  # batch size for API calls
)


async def embed_documents_batched(texts: list[str], batch_size: int = 500) -> list:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        batch_embeddings = await embeddings.aembed_documents(batch)
        all_embeddings.extend(batch_embeddings)
    return all_embeddings

Caching Embeddings

Recomputing embeddings for unchanged documents is pure waste. Implement content-based caching:

import hashlib
import json


def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()


class EmbeddingCache:
    def __init__(self, cache_store):
        self.cache = cache_store
        self.model = OpenAIEmbeddings(model="text-embedding-3-small")

    async def get_or_compute(self, texts: list[str]) -> list:
        results = [None] * len(texts)
        to_compute = []

        for i, text in enumerate(texts):
            h = content_hash(text)
            cached = self.cache.get(h)
            if cached:
                results[i] = json.loads(cached)
            else:
                to_compute.append((i, text, h))

        if to_compute:
            new_texts = [t for _, t, _ in to_compute]
            new_embeddings = await self.model.aembed_documents(new_texts)
            for (i, _, h), emb in zip(to_compute, new_embeddings):
                results[i] = emb
                self.cache.set(h, json.dumps(emb))

        return results

Retrieval Optimization

Hybrid Search (Dense + Sparse)

Pure vector search misses exact keyword matches. Pure BM25 misses semantic similarity. Combine them.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents, k=10)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7],
)

results = hybrid_retriever.invoke("How to configure OAuth2 in the API gateway")

The weight split (30% BM25, 70% dense) is a starting point. Tune it on your evaluation dataset. Technical documentation and code-heavy domains often benefit from higher BM25 weights (40–50%).

Query Transformation

User queries are often vague, incomplete, or poorly phrased for retrieval. Transform them before searching.

HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, then embed that instead of the query. The hypothetical answer is closer in embedding space to real documents than the short query.

from langchain.chains import HypotheticalDocumentEmbedder

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    base_embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    prompt_key="web_search",
)

Multi-query retrieval: generate multiple query variations and merge results. This improves recall for ambiguous queries.

Contextual Compression

Retrieved chunks often contain irrelevant surrounding text. A contextual compressor extracts only the parts relevant to the query, improving the signal-to-noise ratio in the generation prompt.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0)
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=hybrid_retriever,
)

This adds an LLM call per retrieval, so use it selectively — on the top 5 results after reranking, not on 50 raw retrieval results.

Evaluation Framework

You can't improve what you can't measure. Production RAG systems need continuous evaluation across multiple dimensions.

Core Metrics

MetricWhat It MeasuresHow to Compute
FaithfulnessIs the response grounded in retrieved context?LLM-as-judge: check each claim against source chunks
Answer relevanceDoes the response actually answer the question?LLM-as-judge: score answer against original query
Context precisionAre the retrieved chunks relevant to the query?Ratio of relevant chunks in top-k results
Context recallDid retrieval find all relevant information?Compare retrieved info against ground truth answers

RAGAS Framework

RAGAS provides automated evaluation for RAG pipelines. It generates scores without requiring human-labeled ground truth for every question.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": reference_answers,
})

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
#  'context_precision': 0.78, 'context_recall': 0.82}

Evaluation Script

Here's a complete evaluation pipeline that you can run against your RAG system:

import json
from datetime import datetime
from dataclasses import dataclass, asdict
from langchain_openai import ChatOpenAI


@dataclass
class EvalResult:
    question: str
    generated_answer: str
    retrieved_chunks: list[str]
    faithfulness_score: float
    relevance_score: float
    retrieval_precision: float
    latency_ms: float


def evaluate_faithfulness(answer: str, contexts: list[str], llm) -> float:
    context_text = "\n---\n".join(contexts)
    prompt = f"""Given the following context and answer, score how well the answer
is supported by the context. Return a score from 0.0 to 1.0.

Context:
{context_text}

Answer:
{answer}

Score (0.0-1.0):"""

    response = llm.invoke(prompt)
    try:
        return float(response.content.strip())
    except ValueError:
        return 0.0


def run_evaluation(rag_chain, eval_questions: list[dict], output_path: str):
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    results = []

    for item in eval_questions:
        start = datetime.now()
        response = rag_chain.invoke(item["question"])
        latency = (datetime.now() - start).total_seconds() * 1000

        faith_score = evaluate_faithfulness(
            response["answer"], response["contexts"], llm
        )

        results.append(EvalResult(
            question=item["question"],
            generated_answer=response["answer"],
            retrieved_chunks=response["contexts"],
            faithfulness_score=faith_score,
            relevance_score=0.0,  # compute similarly
            retrieval_precision=0.0,  # requires ground truth
            latency_ms=latency,
        ))

    with open(output_path, "w") as f:
        json.dump([asdict(r) for r in results], f, indent=2)

    avg_faith = sum(r.faithfulness_score for r in results) / len(results)
    avg_latency = sum(r.latency_ms for r in results) / len(results)
    print(f"Average faithfulness: {avg_faith:.2f}")
    print(f"Average latency: {avg_latency:.0f}ms")
    print(f"Results written to {output_path}")

Human Evaluation

Automated metrics get you 80% of the way. The remaining 20% requires human review, especially for subjective quality dimensions like "helpfulness" and "completeness." Build a lightweight human evaluation workflow:

  1. Sample 50–100 queries per week from production traffic.
  2. Have domain experts rate responses on a 1–5 scale for relevance, accuracy, and completeness.
  3. Track scores over time to detect regressions.
  4. Feed low-scoring examples back into your evaluation dataset.

Deployment Architecture

Infrastructure

A production RAG system has multiple services that need to scale independently.

                    ┌─────────────┐
                    │   Load      │
                    │   Balancer  │
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │   API       │
                    │   Gateway   │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────┴─────┐ ┌───┴───┐ ┌─────┴─────┐
        │  Query     │ │ Embed │ │ Ingestion │
        │  Service   │ │ Svc   │ │ Workers   │
        └─────┬─────┘ └───┬───┘ └─────┬─────┘
              │            │            │
        ┌─────┴────────────┴────────────┴─────┐
        │           Vector Database            │
        │     (Pinecone / Qdrant / Weaviate)   │
        └──────────────────────────────────────┘

Query Service handles incoming queries, orchestrates retrieval, and calls the LLM. Scale horizontally based on QPS.

Embedding Service wraps the embedding model. If using an API (OpenAI), this is a thin proxy with rate limiting and retry logic. If self-hosting, this runs the model on GPU instances.

Ingestion Workers process new documents asynchronously. These are triggered by file uploads, webhook events, or scheduled crawls. Scale based on ingestion backlog.

Monitoring and Observability

You need visibility into every stage of the pipeline:

  • Retrieval quality: log queries, retrieved chunk IDs, and relevance scores. Dashboards showing retrieval precision over time catch silent degradation.
  • Generation quality: log prompts, responses, and automated faithfulness scores. Alert on faithfulness drops below your threshold.
  • Latency breakdown: instrument each stage (embedding, retrieval, reranking, generation) independently. A P95 latency spike in reranking is invisible if you only measure end-to-end latency.
  • Cost tracking: log token usage per query. A sudden cost increase might indicate a prompt engineering regression or an agent loop issue (see Agentic RAG for loop risks).

Cost Optimization

RAG costs come from three sources: embedding computation, vector database operations, and LLM inference.

Cost DriverOptimization
Embedding API callsCache embeddings by content hash; batch operations; use smaller models where quality is sufficient
Vector DB storageReduce dimensions (Matryoshka embeddings); quantize vectors; prune stale documents
LLM inferenceUse smaller models for simple queries (GPT-4o-mini); cache frequent query-response pairs; truncate context to relevant chunks only
RerankingOnly rerank top-N candidates, not full result set; use lighter rerankers for simple queries

A typical production RAG query costs $0.01–0.05 with managed services. At 100K queries/month, that's $1,000–5,000/month — significant enough to warrant optimization but reasonable for most B2B use cases.

Scaling Strategies

Read scaling: vector databases handle read scaling well. Pinecone and Weaviate scale horizontally with replicas. Qdrant supports distributed deployments.

Write scaling: ingestion is typically the bottleneck. Use message queues (SQS, RabbitMQ) to buffer ingestion requests and process them with auto-scaling workers.

LLM scaling: the LLM is usually the latency bottleneck. Strategies include request batching, model caching with vLLM for self-hosted models, and routing simple queries to faster/cheaper models.

Complete Ingestion Pipeline

Here's a production-grade ingestion pipeline that ties together document loading, preprocessing, chunking, embedding, and indexing:

import asyncio
import logging
from pathlib import Path
from datetime import datetime
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient

logger = logging.getLogger(__name__)

LOADER_MAP = {
    ".pdf": "PyPDFLoader",
    ".html": "UnstructuredHTMLLoader",
    ".md": "UnstructuredMarkdownLoader",
    ".txt": "TextLoader",
}


class ProductionIngestionPipeline:
    def __init__(self, qdrant_url: str, collection_name: str):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", ". ", " ", ""],
        )
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small",
            chunk_size=500,
        )
        self.client = QdrantClient(url=qdrant_url)
        self.collection_name = collection_name
        self.vectorstore = Qdrant(
            client=self.client,
            collection_name=collection_name,
            embeddings=self.embeddings,
        )

    def load_document(self, file_path: str) -> list:
        ext = Path(file_path).suffix.lower()
        if ext not in LOADER_MAP:
            logger.warning(f"Skipping unsupported file type: {ext}")
            return []

        from langchain_community.document_loaders import (
            PyPDFLoader,
            UnstructuredHTMLLoader,
            UnstructuredMarkdownLoader,
            TextLoader,
        )

        loader_classes = {
            ".pdf": PyPDFLoader,
            ".html": UnstructuredHTMLLoader,
            ".md": UnstructuredMarkdownLoader,
            ".txt": TextLoader,
        }

        loader = loader_classes[ext](file_path)
        return loader.load()

    def preprocess(self, docs: list) -> list:
        import re
        import unicodedata

        for doc in docs:
            text = unicodedata.normalize("NFKC", doc.page_content)
            text = re.sub(r"\n{3,}", "\n\n", text)
            text = re.sub(r"[ \t]+", " ", text)
            doc.page_content = text.strip()

        return [doc for doc in docs if len(doc.page_content) > 50]

    def chunk(self, docs: list) -> list:
        chunks = self.splitter.split_documents(docs)
        for i, chunk in enumerate(chunks):
            chunk.metadata["chunk_index"] = i
            chunk.metadata["ingested_at"] = datetime.utcnow().isoformat()
        return chunks

    async def ingest_file(self, file_path: str) -> int:
        logger.info(f"Ingesting {file_path}")
        docs = self.load_document(file_path)
        if not docs:
            return 0

        docs = self.preprocess(docs)
        chunks = self.chunk(docs)

        for chunk in chunks:
            chunk.metadata["source"] = file_path
            chunk.metadata["file_type"] = Path(file_path).suffix

        await self.vectorstore.aadd_documents(chunks)
        logger.info(f"Ingested {len(chunks)} chunks from {file_path}")
        return len(chunks)

    async def ingest_directory(self, dir_path: str) -> dict:
        results = {"total_files": 0, "total_chunks": 0, "errors": []}
        for file_path in Path(dir_path).rglob("*"):
            if file_path.suffix.lower() in LOADER_MAP:
                try:
                    count = await self.ingest_file(str(file_path))
                    results["total_files"] += 1
                    results["total_chunks"] += count
                except Exception as e:
                    logger.error(f"Failed to ingest {file_path}: {e}")
                    results["errors"].append({"file": str(file_path), "error": str(e)})
        return results


async def main():
    pipeline = ProductionIngestionPipeline(
        qdrant_url="http://localhost:6333",
        collection_name="knowledge_base",
    )
    results = await pipeline.ingest_directory("./documents")
    print(f"Ingested {results['total_files']} files, "
          f"{results['total_chunks']} chunks, "
          f"{len(results['errors'])} errors")


if __name__ == "__main__":
    asyncio.run(main())

Managed vs Self-Hosted RAG Platforms

DimensionManaged (e.g., Pinecone + OpenAI)Self-Hosted (e.g., Qdrant + BGE)
Setup timeHoursDays to weeks
Operational overheadMinimal (vendor manages infra)Significant (you manage everything)
Cost at low scaleLow (pay per use)Higher (minimum infrastructure)
Cost at high scaleCan become expensiveMore predictable, often lower
Data privacyData sent to external APIsData stays in your infrastructure
CustomizationLimited to vendor capabilitiesFull control over every component
LatencyDepends on vendor SLAsOptimizable for your workload
Vendor lock-inModerate to highNone
GPU managementNot your problemYour problem (for self-hosted embeddings)

For most teams starting out, the recommendation is: begin managed, migrate components to self-hosted as specific needs arise. Data privacy requirements (healthcare, finance, government) often force self-hosted embedding models first. Cost pressure at scale typically pushes vector database self-hosting second. LLM self-hosting is the last migration for most teams, as open models are still catching up on instruction-following quality for complex RAG prompts.

Putting It All Together

Building a production RAG system is a systems engineering problem, not an AI problem. The LLM and embedding model are commodities — the competitive advantage is in your data pipeline, evaluation framework, and operational infrastructure.

Start with the simplest architecture that could work:

  1. Recursive character text splitting with 1000-token chunks
  2. OpenAI text-embedding-3-small for embeddings
  3. A managed vector database (Pinecone or Weaviate Cloud)
  4. GPT-4o-mini for generation
  5. RAGAS for evaluation

Then iterate based on evidence from your evaluation metrics. Don't add hybrid search until you've measured that pure vector search is insufficient. Don't add reranking until you've measured that retrieval precision is the bottleneck. Don't add agentic patterns until you've confirmed that single-pass retrieval can't handle your query complexity — and when you do, see the agentic RAG guide for implementation patterns.

Every architectural decision should be backed by a measurable improvement on your evaluation dataset. If you can't measure it, you can't justify the complexity.

For foundational RAG concepts, revisit the RAG fundamentals guide. For a deep dive into retrieval and generation architectures, see the RAG architecture deep dive. And for deciding whether RAG is even the right approach for your use case, check RAG vs Fine-Tuning.

References

  1. Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
  2. Es, S., et al. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv:2309.15217.
  3. Gao, Y., et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv:2312.10997.
  4. Anthropic. (2024). "Contextual Retrieval." https://www.anthropic.com/news/contextual-retrieval
  5. LangChain Documentation. "RAG." https://python.langchain.com/docs/tutorials/rag/
  6. LlamaIndex Documentation. "Building a RAG Pipeline." https://docs.llamaindex.ai/
  7. Pinecone. "RAG Guide." https://www.pinecone.io/learn/retrieval-augmented-generation/
  8. Wang, L., et al. (2023). "Query2doc: Query Expansion with Large Language Models." arXiv:2303.07678.

FAQ

What's the minimum viable tech stack for a production RAG system?

An embedding model, a vector database, and an LLM — orchestrated by a framework like LangChain or LlamaIndex (or custom code). For the fastest path to production: OpenAI embeddings + Pinecone + GPT-4o-mini. This stack requires no GPU infrastructure, scales with managed services, and costs roughly $0.01–0.03 per query. Add complexity only when evaluation metrics justify it.

How often should I re-index my knowledge base?

It depends on how frequently your source documents change. For static documentation, a weekly full re-index is sufficient. For dynamic content (support tickets, wiki pages), implement incremental indexing triggered by change events. The key is tracking document versions — only re-embed chunks whose content has actually changed (use content hashing to detect changes).

How do I handle documents that are too large for a single chunk?

Use the parent-child document strategy: split large documents into small chunks for retrieval but store the relationship to parent sections. When a small chunk matches, return the full parent section (or surrounding chunks) for generation context. This gives you precise retrieval with rich generation context.

What's the most common reason production RAG systems fail?

Poor chunking. Teams invest in expensive embedding models and sophisticated retrieval algorithms while splitting documents with naive fixed-size chunking that breaks mid-thought. A well-chunked knowledge base with a basic embedding model outperforms a poorly-chunked one with the best model every time. Invest in domain-appropriate chunking before anything else.

How do I evaluate RAG quality without labeled ground truth data?

Use LLM-as-judge metrics from frameworks like RAGAS. Faithfulness (is the answer grounded in retrieved context?) and answer relevance (does the answer address the question?) can be computed without ground truth. For context precision and recall, you'll eventually need reference answers — start by manually labeling 50–100 representative queries. This small labeled set gives you a reliable benchmark while you build out a larger evaluation dataset.

Should I use one large vector collection or split into multiple?

Start with a single collection and use metadata filtering to segment content. Split into separate collections only when you have distinct content domains with different embedding models, different access control requirements, or when a single collection exceeds your vector database's performance limits. Multiple collections add operational complexity (query routing, index management, consistency) that isn't justified until you hit concrete scaling or isolation requirements.

Related Posts