RAG Architecture Deep Dive: Embeddings, Vector Databases, and Retrieval
Building a RAG prototype takes an afternoon. Building a RAG system that retrieves the right context reliably at scale takes serious architectural decisions. The choice of embedding model, vector database, chunking strategy, and retrieval mechanism each have outsized impact on the quality of your final output.
This post is a technical deep dive into the components that make or break a RAG system. If you're new to RAG, start with RAG Fundamentals for the basics and code examples, then come back here for the architecture decisions.
RAG Architecture Components
A production RAG system has six key components, each with its own design decisions:
┌─────────────────────────────────────────────────────────────────┐
│ INDEXING PIPELINE (Offline) │
│ │
│ Documents → Loader → Chunker → Embedder → Vector Store │
│ │ │
│ Embedding Model │
│ (OpenAI, Cohere, SBERT) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE (Online) │
│ │
│ Query → Embedder → Vector Search → Re-ranker → Prompt Builder │
│ │ │
│ LLM Gen │
│ │ │
│ Response │
└─────────────────────────────────────────────────────────────────┘
Let's break down each component.
Embeddings Deep Dive
The embedding model is the most critical choice in your RAG stack. It determines how well your system understands the semantic relationship between queries and documents.
Commercial Embedding Models
OpenAI text-embedding-3-large
The current benchmark leader among commercial options. 3072 dimensions (with native dimension reduction support). Strong multilingual performance. ~$0.13 per 1M tokens.
OpenAI text-embedding-3-small
A lighter option at 1536 dimensions. Good balance of quality and cost at ~$0.02 per 1M tokens. Sufficient for most production use cases.
Cohere embed-v3
Competitive with OpenAI, with native support for different input types (search_query vs search_document) that improves retrieval quality. Available in English and multilingual variants.
Open-Source Embedding Models
sentence-transformers/all-MiniLM-L6-v2
The workhorse of open-source embeddings. 384 dimensions, fast inference, good quality for English text. Runs on CPU.
BAAI/bge-large-en-v1.5
Higher quality than MiniLM at 1024 dimensions. Excellent MTEB benchmark scores. Requires a GPU for reasonable throughput.
nomic-ai/nomic-embed-text-v1.5
Strong contender with 768 dimensions. Supports an 8192 token context window — much larger than most alternatives — making it excellent for longer chunks.
Embedding Model Selection Criteria
| Factor | Recommendation |
|---|---|
| Prototyping | all-MiniLM-L6-v2 (free, fast, good enough) |
| Production (cost-sensitive) | OpenAI text-embedding-3-small |
| Production (quality-first) | OpenAI text-embedding-3-large or Cohere embed-v3 |
| Self-hosted / air-gapped | bge-large-en-v1.5 or nomic-embed-text-v1.5 |
| Multilingual | Cohere embed-v3 multilingual or intfloat/multilingual-e5-large |
Vector Database Comparison
Vector databases store your embeddings and provide fast similarity search. The landscape has matured significantly, with options ranging from in-process libraries to fully managed cloud services.
| Feature | FAISS | Chroma | Pinecone | Weaviate | Qdrant |
|---|---|---|---|---|---|
| Type | Library | Embedded/Client-Server | Managed cloud | Self-hosted + Cloud | Self-hosted + Cloud |
| Setup | pip install | pip install | API key | Docker or Cloud | Docker or Cloud |
| Persistence | Manual (save/load) | Built-in | Managed | Built-in | Built-in |
| Metadata filtering | None | Yes | Yes | Yes (GraphQL) | Yes (rich filters) |
| Max vectors | Billions (RAM) | Millions | Billions | Billions | Billions |
| ANN algorithm | IVF, HNSW, PQ | HNSW | Proprietary | HNSW | HNSW |
| Hybrid search | No | No | Yes (sparse+dense) | Yes (BM25+vector) | Yes (sparse+dense) |
| Pricing | Free (OSS) | Free (OSS) | From $70/mo | Free (OSS) / Cloud | Free (OSS) / Cloud |
| Best for | Research, prototypes | Local dev, small apps | Managed production | Full-featured self-host | Performance-critical |
When to Use Each
FAISS — Best for research and prototyping when you need maximum flexibility and don't need persistence or metadata filtering. Built by Meta, battle-tested at scale.
Chroma — The easiest way to get started. Runs in-process with Python, has a good API, and handles persistence. Ideal for local development and smaller production workloads.
Pinecone — Best fully managed option. Zero infrastructure management, strong performance, and serverless pricing. Choose this when you don't want to operate infrastructure.
Weaviate — Feature-rich with built-in vectorization, GraphQL API, and multi-tenancy. Strong choice for teams that want a self-hosted solution with enterprise features.
Qdrant — Excellent performance, rich filtering, and a clean API. Written in Rust with strong focus on speed. Growing quickly in the ecosystem.
Retrieval Mechanisms
How you search your vector store matters as much as what you store in it.
Dense Retrieval
The standard approach: embed the query, find the K nearest vectors by cosine similarity or dot product. Works well when your embedding model captures the semantic relationship between queries and documents.
# Dense retrieval with Chroma
results = collection.query(
query_embeddings=[query_embedding],
n_results=5
)
Strengths: Understands meaning, handles paraphrases well. Weaknesses: Can miss exact keyword matches, struggles with rare terms and proper nouns.
Sparse Retrieval
Traditional keyword-based search using algorithms like BM25 or TF-IDF. Represents documents as sparse vectors based on term frequency.
Strengths: Precise keyword matching, handles rare terms well, fast. Weaknesses: No semantic understanding — "car" and "automobile" are completely different.
Hybrid Search
Combines dense and sparse retrieval to get the best of both worlds. The query runs through both a vector search and a keyword search, and results are merged using reciprocal rank fusion (RRF) or a learned combination.
# Hybrid search with Weaviate
results = client.query.get("Document", ["content"]) \
.with_hybrid(query="RAG architecture patterns", alpha=0.75) \
.with_limit(5) \
.do()
# alpha=1.0 → pure vector search
# alpha=0.0 → pure keyword search
# alpha=0.75 → weighted toward vector (recommended starting point)
Hybrid search consistently outperforms either approach alone in benchmarks. If your vector database supports it, use it.
Re-Ranking Strategies
The initial retrieval step is optimized for speed (searching millions of vectors in milliseconds). Re-ranking adds a second, more accurate scoring pass on just the top-K results.
Cross-Encoder Re-Rankers
Cross-encoders process the query and each document together through a transformer, producing a relevance score. This is more accurate than bi-encoder similarity (used in initial retrieval) but much slower — which is why you re-rank only the top 20–50 results, not the entire corpus.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
query = "How does RAG reduce hallucinations?"
documents = [doc.page_content for doc in retrieved_docs]
# Score each query-document pair
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)
# Re-sort by relevance score
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
top_docs = [doc for doc, score in ranked[:5]]
Cohere Rerank
Cohere offers a managed re-ranking API that's easy to integrate:
import cohere
co = cohere.Client("your-api-key")
results = co.rerank(
query="How does RAG reduce hallucinations?",
documents=retrieved_texts,
model="rerank-english-v3.0",
top_n=5
)
for result in results.results:
print(f"Score: {result.relevance_score:.4f} | {result.document.text[:80]}...")
Re-ranking is one of the highest-ROI improvements you can make to a RAG system. It typically improves retrieval precision by 10–30% with minimal latency cost (50–100ms for 20 documents).
Chunking Strategies
How you split documents into chunks directly impacts retrieval quality. There's no universal best strategy — it depends on your document structure and query patterns.
Fixed-Size Chunking
Split text into chunks of a fixed token or character count with overlap.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
Pros: Simple, predictable chunk sizes, works with any text. Cons: Breaks sentences and paragraphs at arbitrary boundaries.
Semantic Chunking
Groups semantically related sentences together by measuring embedding similarity between consecutive sentences. Chunk boundaries align with topic shifts.
Pros: Chunks are more coherent and self-contained. Cons: More expensive (requires embedding every sentence), variable chunk sizes.
Recursive Chunking
The RecursiveCharacterTextSplitter in LangChain tries to split on natural boundaries (paragraphs, then sentences, then words) before falling back to character splits. This preserves document structure better than naive fixed-size splitting.
Document-Aware Chunking
For structured documents (Markdown, HTML, code), split at structural boundaries — headings, sections, functions. This preserves the logical units of the document.
Chunking Guidelines
| Document Type | Recommended Strategy | Chunk Size |
|---|---|---|
| Long-form text (articles, reports) | Recursive | 400–800 tokens |
| Technical documentation | Document-aware (by section) | 300–600 tokens |
| Code | Function/class level | Varies |
| Q&A / FAQ | One chunk per Q&A pair | Varies |
| Chat logs | By conversation turn | 200–400 tokens |
Production RAG Pipeline Architecture
Here's what a production-grade RAG system looks like end to end:
┌──────────────────────────────────────────────────────────────┐
│ DATA INGESTION LAYER │
│ │
│ Connectors (S3, APIs, DBs) → Preprocessing → Chunking │
│ → Embedding (batched) → Vector DB Upsert │
│ → Change Detection (incremental updates) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ RETRIEVAL LAYER │
│ │
│ Query Understanding → Query Expansion/Rewriting │
│ → Hybrid Search (dense + sparse) │
│ → Metadata Filtering │
│ → Re-ranking (cross-encoder) │
│ → Context Assembly (dedup, ordering, truncation) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ GENERATION LAYER │
│ │
│ Prompt Template → LLM Call → Response Parsing │
│ → Citation Extraction → Hallucination Check │
│ → Streaming Response │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ EVALUATION LAYER │
│ │
│ Retrieval Metrics (recall@K, MRR, NDCG) │
│ Generation Metrics (faithfulness, relevance, completeness) │
│ End-to-End Metrics (user satisfaction, task completion) │
└──────────────────────────────────────────────────────────────┘
For implementation details on evaluation, monitoring, and scaling this architecture, see Building a Production RAG System.
Code: Setting Up Chroma with LangChain
Here's a practical example of setting up a persistent Chroma vector store with LangChain, including metadata filtering:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
documents = [
Document(
page_content="RAG combines retrieval with generation for grounded AI responses.",
metadata={"source": "rag-guide.md", "section": "intro", "year": 2024}
),
Document(
page_content="Vector databases store embeddings for fast similarity search.",
metadata={"source": "vector-db.md", "section": "overview", "year": 2024}
),
Document(
page_content="HNSW is an approximate nearest neighbor algorithm used by most vector DBs.",
metadata={"source": "vector-db.md", "section": "algorithms", "year": 2023}
),
]
vectorstore = Chroma.from_documents(
documents=documents,
embedding=embedding_model,
persist_directory="./chroma_db",
collection_name="rag_docs"
)
# Basic similarity search
results = vectorstore.similarity_search("How do vector databases work?", k=2)
for doc in results:
print(f"[{doc.metadata['source']}] {doc.page_content[:80]}...")
# Search with metadata filter
results = vectorstore.similarity_search(
"vector database algorithms",
k=2,
filter={"year": 2024}
)
# Search with relevance scores
results = vectorstore.similarity_search_with_relevance_scores(
"retrieval augmented generation",
k=3
)
for doc, score in results:
print(f"Score: {score:.4f} | {doc.page_content[:60]}...")
What's Next
You now understand the architectural components of a RAG system. The next step is putting it all together into a production-grade system with evaluation, monitoring, and scaling — covered in Building a Production RAG System.
For the broader context of how RAG fits into the AI landscape and when to use it versus alternatives, revisit The Complete Guide to RAG.
References
- Johnson, J., Douze, M., & Jégou, H. (2021). Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data.
- Malkov, Y., & Yashunin, D. (2018). Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs. IEEE TPAMI.
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
- Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv.
- Chroma Documentation. Getting Started.
- Pinecone Documentation. Vector Database Overview.
- Weaviate Documentation. Hybrid Search.
FAQ
Which vector database should I start with?
For prototyping and learning, use Chroma — it's the simplest to set up (just pip install chromadb) and runs in-process. For production, evaluate Pinecone (managed, zero-ops) or Qdrant/Weaviate (self-hosted, more control) based on your infrastructure preferences and scale requirements.
How do I choose the right chunk size?
Start with 400–600 tokens with 10–15% overlap. Then evaluate: if retrieval precision is low (the right chunks aren't in top-K), try smaller chunks. If retrieved chunks lack sufficient context, try larger chunks or add surrounding context. There's no universal optimum — it depends on your documents and queries.
Is hybrid search always better than pure vector search?
In most benchmarks, yes — hybrid search combining dense vectors and sparse keyword matching outperforms either alone. The improvement is especially significant for queries with specific keywords, technical terms, or proper nouns. The main tradeoff is that not all vector databases support it natively.
How much does re-ranking improve results?
Cross-encoder re-ranking typically improves precision@5 by 10–30% over bi-encoder similarity alone. The latency cost is modest (50–150ms for re-ranking 20–50 documents) and the quality improvement is consistently one of the best investments in a RAG pipeline.
Can I use multiple embedding models in the same system?
Generally, no — the query and document embeddings must come from the same model to be comparable. However, you can maintain separate vector collections with different embedding models and merge results at the re-ranking stage. This adds complexity and is rarely worth it unless you have very diverse content types.
Related Posts

Agentic RAG: Multi-Agent Systems, Planning, and Tool Integration
How agentic RAG combines retrieval-augmented generation with autonomous agents — ReAct patterns, chain-of-thought planning, memory systems, and building multi-agent RAG pipelines.

Building a Production-Ready RAG System: From Prototype to Deployment
A complete guide to building production RAG systems — tech stack selection, data ingestion pipelines, chunking strategies, evaluation frameworks, and deployment architecture with code examples.

The Complete Guide to Retrieval-Augmented Generation (RAG)
Everything you need to know about RAG — from fundamentals and architecture to production deployment. The definitive guide for developers building AI systems with retrieval-augmented generation.