RAG Architecture Deep Dive: Embeddings, Vector Databases, and Retrieval

Building a RAG prototype takes an afternoon. Building a RAG system that retrieves the right context reliably at scale takes serious architectural decisions. The choice of embedding model, vector database, chunking strategy, and retrieval mechanism each have outsized impact on the quality of your final output.

This post is a technical deep dive into the components that make or break a RAG system. If you're new to RAG, start with RAG Fundamentals for the basics and code examples, then come back here for the architecture decisions.

RAG Architecture Components

A production RAG system has six key components, each with its own design decisions:

┌─────────────────────────────────────────────────────────────────┐
│                    INDEXING PIPELINE (Offline)                   │
│                                                                 │
│  Documents → Loader → Chunker → Embedder → Vector Store         │
│                                    │                            │
│                         Embedding Model                         │
│                    (OpenAI, Cohere, SBERT)                      │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    QUERY PIPELINE (Online)                       │
│                                                                 │
│  Query → Embedder → Vector Search → Re-ranker → Prompt Builder  │
│                                                        │        │
│                                                     LLM Gen     │
│                                                        │        │
│                                                    Response      │
└─────────────────────────────────────────────────────────────────┘

Let's break down each component.

Embeddings Deep Dive

The embedding model is the most critical choice in your RAG stack. It determines how well your system understands the semantic relationship between queries and documents.

Commercial Embedding Models

OpenAI text-embedding-3-large The current benchmark leader among commercial options. 3072 dimensions (with native dimension reduction support). Strong multilingual performance. ~$0.13 per 1M tokens.

OpenAI text-embedding-3-small A lighter option at 1536 dimensions. Good balance of quality and cost at ~$0.02 per 1M tokens. Sufficient for most production use cases.

Cohere embed-v3 Competitive with OpenAI, with native support for different input types (search_query vs search_document) that improves retrieval quality. Available in English and multilingual variants.

Open-Source Embedding Models

sentence-transformers/all-MiniLM-L6-v2 The workhorse of open-source embeddings. 384 dimensions, fast inference, good quality for English text. Runs on CPU.

BAAI/bge-large-en-v1.5 Higher quality than MiniLM at 1024 dimensions. Excellent MTEB benchmark scores. Requires a GPU for reasonable throughput.

nomic-ai/nomic-embed-text-v1.5 Strong contender with 768 dimensions. Supports an 8192 token context window — much larger than most alternatives — making it excellent for longer chunks.

Embedding Model Selection Criteria

Factor	Recommendation
Prototyping	`all-MiniLM-L6-v2` (free, fast, good enough)
Production (cost-sensitive)	OpenAI `text-embedding-3-small`
Production (quality-first)	OpenAI `text-embedding-3-large` or Cohere `embed-v3`
Self-hosted / air-gapped	`bge-large-en-v1.5` or `nomic-embed-text-v1.5`
Multilingual	Cohere `embed-v3` multilingual or `intfloat/multilingual-e5-large`

Vector Database Comparison

Vector databases store your embeddings and provide fast similarity search. The landscape has matured significantly, with options ranging from in-process libraries to fully managed cloud services.

Feature	FAISS	Chroma	Pinecone	Weaviate	Qdrant
Type	Library	Embedded/Client-Server	Managed cloud	Self-hosted + Cloud	Self-hosted + Cloud
Setup	pip install	pip install	API key	Docker or Cloud	Docker or Cloud
Persistence	Manual (save/load)	Built-in	Managed	Built-in	Built-in
Metadata filtering	None	Yes	Yes	Yes (GraphQL)	Yes (rich filters)
Max vectors	Billions (RAM)	Millions	Billions	Billions	Billions
ANN algorithm	IVF, HNSW, PQ	HNSW	Proprietary	HNSW	HNSW
Hybrid search	No	No	Yes (sparse+dense)	Yes (BM25+vector)	Yes (sparse+dense)
Pricing	Free (OSS)	Free (OSS)	From $70/mo	Free (OSS) / Cloud	Free (OSS) / Cloud
Best for	Research, prototypes	Local dev, small apps	Managed production	Full-featured self-host	Performance-critical

When to Use Each

FAISS — Best for research and prototyping when you need maximum flexibility and don't need persistence or metadata filtering. Built by Meta, battle-tested at scale.

Chroma — The easiest way to get started. Runs in-process with Python, has a good API, and handles persistence. Ideal for local development and smaller production workloads.

Pinecone — Best fully managed option. Zero infrastructure management, strong performance, and serverless pricing. Choose this when you don't want to operate infrastructure.

Weaviate — Feature-rich with built-in vectorization, GraphQL API, and multi-tenancy. Strong choice for teams that want a self-hosted solution with enterprise features.

Qdrant — Excellent performance, rich filtering, and a clean API. Written in Rust with strong focus on speed. Growing quickly in the ecosystem.

Retrieval Mechanisms

How you search your vector store matters as much as what you store in it.

Dense Retrieval

The standard approach: embed the query, find the K nearest vectors by cosine similarity or dot product. Works well when your embedding model captures the semantic relationship between queries and documents.

# Dense retrieval with Chroma
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5
)

Strengths: Understands meaning, handles paraphrases well. Weaknesses: Can miss exact keyword matches, struggles with rare terms and proper nouns.

Sparse Retrieval

Traditional keyword-based search using algorithms like BM25 or TF-IDF. Represents documents as sparse vectors based on term frequency.

Strengths: Precise keyword matching, handles rare terms well, fast. Weaknesses: No semantic understanding — "car" and "automobile" are completely different.

Hybrid Search

Combines dense and sparse retrieval to get the best of both worlds. The query runs through both a vector search and a keyword search, and results are merged using reciprocal rank fusion (RRF) or a learned combination.

# Hybrid search with Weaviate
results = client.query.get("Document", ["content"]) \
    .with_hybrid(query="RAG architecture patterns", alpha=0.75) \
    .with_limit(5) \
    .do()
# alpha=1.0 → pure vector search
# alpha=0.0 → pure keyword search
# alpha=0.75 → weighted toward vector (recommended starting point)

Hybrid search consistently outperforms either approach alone in benchmarks. If your vector database supports it, use it.

Re-Ranking Strategies

The initial retrieval step is optimized for speed (searching millions of vectors in milliseconds). Re-ranking adds a second, more accurate scoring pass on just the top-K results.

Cross-Encoder Re-Rankers

Cross-encoders process the query and each document together through a transformer, producing a relevance score. This is more accurate than bi-encoder similarity (used in initial retrieval) but much slower — which is why you re-rank only the top 20–50 results, not the entire corpus.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

query = "How does RAG reduce hallucinations?"
documents = [doc.page_content for doc in retrieved_docs]

# Score each query-document pair
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)

# Re-sort by relevance score
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
top_docs = [doc for doc, score in ranked[:5]]

Cohere Rerank

Cohere offers a managed re-ranking API that's easy to integrate:

import cohere

co = cohere.Client("your-api-key")

results = co.rerank(
    query="How does RAG reduce hallucinations?",
    documents=retrieved_texts,
    model="rerank-english-v3.0",
    top_n=5
)

for result in results.results:
    print(f"Score: {result.relevance_score:.4f} | {result.document.text[:80]}...")

Re-ranking is one of the highest-ROI improvements you can make to a RAG system. It typically improves retrieval precision by 10–30% with minimal latency cost (50–100ms for 20 documents).

Chunking Strategies

How you split documents into chunks directly impacts retrieval quality. There's no universal best strategy — it depends on your document structure and query patterns.

Fixed-Size Chunking

Split text into chunks of a fixed token or character count with overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)

Pros: Simple, predictable chunk sizes, works with any text. Cons: Breaks sentences and paragraphs at arbitrary boundaries.

Semantic Chunking

Groups semantically related sentences together by measuring embedding similarity between consecutive sentences. Chunk boundaries align with topic shifts.

Pros: Chunks are more coherent and self-contained. Cons: More expensive (requires embedding every sentence), variable chunk sizes.

Recursive Chunking

The RecursiveCharacterTextSplitter in LangChain tries to split on natural boundaries (paragraphs, then sentences, then words) before falling back to character splits. This preserves document structure better than naive fixed-size splitting.

Document-Aware Chunking

For structured documents (Markdown, HTML, code), split at structural boundaries — headings, sections, functions. This preserves the logical units of the document.

Chunking Guidelines

Document Type	Recommended Strategy	Chunk Size
Long-form text (articles, reports)	Recursive	400–800 tokens
Technical documentation	Document-aware (by section)	300–600 tokens
Code	Function/class level	Varies
Q&A / FAQ	One chunk per Q&A pair	Varies
Chat logs	By conversation turn	200–400 tokens

Production RAG Pipeline Architecture

Here's what a production-grade RAG system looks like end to end:

┌──────────────────────────────────────────────────────────────┐
│                     DATA INGESTION LAYER                      │
│                                                              │
│  Connectors (S3, APIs, DBs) → Preprocessing → Chunking       │
│       → Embedding (batched) → Vector DB Upsert               │
│       → Change Detection (incremental updates)               │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                      RETRIEVAL LAYER                          │
│                                                              │
│  Query Understanding → Query Expansion/Rewriting              │
│       → Hybrid Search (dense + sparse)                       │
│       → Metadata Filtering                                   │
│       → Re-ranking (cross-encoder)                           │
│       → Context Assembly (dedup, ordering, truncation)       │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                     GENERATION LAYER                          │
│                                                              │
│  Prompt Template → LLM Call → Response Parsing               │
│       → Citation Extraction → Hallucination Check            │
│       → Streaming Response                                   │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                     EVALUATION LAYER                          │
│                                                              │
│  Retrieval Metrics (recall@K, MRR, NDCG)                    │
│  Generation Metrics (faithfulness, relevance, completeness)  │
│  End-to-End Metrics (user satisfaction, task completion)      │
└──────────────────────────────────────────────────────────────┘

For implementation details on evaluation, monitoring, and scaling this architecture, see Building a Production RAG System.

Code: Setting Up Chroma with LangChain

Here's a practical example of setting up a persistent Chroma vector store with LangChain, including metadata filtering:

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

documents = [
    Document(
        page_content="RAG combines retrieval with generation for grounded AI responses.",
        metadata={"source": "rag-guide.md", "section": "intro", "year": 2024}
    ),
    Document(
        page_content="Vector databases store embeddings for fast similarity search.",
        metadata={"source": "vector-db.md", "section": "overview", "year": 2024}
    ),
    Document(
        page_content="HNSW is an approximate nearest neighbor algorithm used by most vector DBs.",
        metadata={"source": "vector-db.md", "section": "algorithms", "year": 2023}
    ),
]

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embedding_model,
    persist_directory="./chroma_db",
    collection_name="rag_docs"
)

# Basic similarity search
results = vectorstore.similarity_search("How do vector databases work?", k=2)
for doc in results:
    print(f"[{doc.metadata['source']}] {doc.page_content[:80]}...")

# Search with metadata filter
results = vectorstore.similarity_search(
    "vector database algorithms",
    k=2,
    filter={"year": 2024}
)

# Search with relevance scores
results = vectorstore.similarity_search_with_relevance_scores(
    "retrieval augmented generation",
    k=3
)
for doc, score in results:
    print(f"Score: {score:.4f} | {doc.page_content[:60]}...")

What's Next

You now understand the architectural components of a RAG system. The next step is putting it all together into a production-grade system with evaluation, monitoring, and scaling — covered in Building a Production RAG System.

For the broader context of how RAG fits into the AI landscape and when to use it versus alternatives, revisit The Complete Guide to RAG.

References

Johnson, J., Douze, M., & Jégou, H. (2021). Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data.
Malkov, Y., & Yashunin, D. (2018). Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs. IEEE TPAMI.
Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv.
Chroma Documentation. Getting Started.
Pinecone Documentation. Vector Database Overview.
Weaviate Documentation. Hybrid Search.

FAQ

Which vector database should I start with?

For prototyping and learning, use Chroma — it's the simplest to set up (just pip install chromadb) and runs in-process. For production, evaluate Pinecone (managed, zero-ops) or Qdrant/Weaviate (self-hosted, more control) based on your infrastructure preferences and scale requirements.

How do I choose the right chunk size?

Start with 400–600 tokens with 10–15% overlap. Then evaluate: if retrieval precision is low (the right chunks aren't in top-K), try smaller chunks. If retrieved chunks lack sufficient context, try larger chunks or add surrounding context. There's no universal optimum — it depends on your documents and queries.

Is hybrid search always better than pure vector search?

In most benchmarks, yes — hybrid search combining dense vectors and sparse keyword matching outperforms either alone. The improvement is especially significant for queries with specific keywords, technical terms, or proper nouns. The main tradeoff is that not all vector databases support it natively.

How much does re-ranking improve results?

Cross-encoder re-ranking typically improves precision@5 by 10–30% over bi-encoder similarity alone. The latency cost is modest (50–150ms for re-ranking 20–50 documents) and the quality improvement is consistently one of the best investments in a RAG pipeline.

Can I use multiple embedding models in the same system?

Generally, no — the query and document embeddings must come from the same model to be comparable. However, you can maintain separate vector collections with different embedding models and merge results at the re-ranking stage. This adds complexity and is rarely worth it unless you have very diverse content types.