Agentic RAG: Multi-Agent Systems, Planning, and Tool Integration
Standard RAG follows a rigid retrieve-then-generate pipeline: embed a query, fetch documents, stuff them into a prompt, and generate a response. It works well for straightforward factual lookups. But the moment a question requires reasoning across multiple sources, deciding which tool to use, or breaking a complex task into sub-steps, vanilla RAG falls apart.
Agentic RAG fixes this by placing an autonomous agent at the center of the retrieval pipeline. The agent doesn't just retrieve — it reasons about what information it needs, plans a strategy, executes retrieval actions (and other tools), evaluates results, and iterates until it has a satisfactory answer.
If you're new to retrieval-augmented generation, start with the complete guide to RAG before diving into agent architectures.
What Makes RAG "Agentic"
In standard RAG, the flow is deterministic:
- User query arrives
- Query gets embedded
- Vector search retrieves top-k chunks
- LLM generates response from retrieved context
Agentic RAG replaces this fixed pipeline with a dynamic decision loop. The agent:
- Decides whether retrieval is needed — some queries can be answered from the conversation or the model's parametric knowledge.
- Selects which retrieval source to query — vector store, web search, SQL database, or API.
- Reformulates the query — rephrasing for better recall before hitting the retriever.
- Evaluates retrieval results — determines if the fetched documents actually answer the question.
- Iterates — if results are insufficient, the agent retries with a different strategy.
The key difference: the LLM is no longer a passive consumer of retrieved context. It's an active participant in the retrieval process itself.
The Agent Loop: Observe → Think → Act → Retrieve
Every agentic RAG system follows a core loop, regardless of implementation framework:
Observe
The agent receives the current state: the user's query, conversation history, previously retrieved documents, and tool outputs. This observation phase grounds the agent in what it knows so far.
Think
The agent reasons about what to do next. This is where chain-of-thought planning happens. The agent might decide it needs to search for a specific subtopic, verify a claim, or combine information from two different sources.
Act
Based on its reasoning, the agent selects and executes an action — calling a retrieval tool, running a web search, executing code, or generating a final response.
Retrieve
When the chosen action is retrieval, the agent formulates a query, sends it to the appropriate knowledge source, and receives results that feed back into the observation phase. The loop continues until the agent decides it has enough information to generate a complete answer.
The ReAct Pattern
ReAct (Reasoning + Acting) is the foundational pattern for agentic RAG. Introduced by Yao et al. (2022), it interleaves reasoning traces with concrete actions, giving the LLM a structured way to think through multi-step problems.
A ReAct trace looks like this:
Question: What were the key architectural changes between GPT-3 and GPT-4,
and how did they affect performance on reasoning benchmarks?
Thought: This question requires information about two models. I should first
search for GPT-3 architecture details, then GPT-4, then compare benchmark results.
Action: search_knowledge_base("GPT-3 architecture transformer details")
Observation: [Retrieved chunks about GPT-3's 175B parameter dense transformer...]
Thought: I have GPT-3 details. Now I need GPT-4 architecture information.
The knowledge base may have the technical report.
Action: search_knowledge_base("GPT-4 architecture changes mixture of experts")
Observation: [Retrieved chunks about GPT-4's rumored MoE architecture...]
Thought: I have architectural details for both. Now I need benchmark comparisons
to complete the answer.
Action: search_knowledge_base("GPT-3 vs GPT-4 reasoning benchmark results")
Observation: [Retrieved chunks with MMLU, HellaSwag, ARC scores...]
Thought: I now have enough information to provide a comprehensive comparison.
Action: generate_response
Each step is explicitly traced, making the system debuggable and auditable — critical for production RAG deployments.
Chain-of-Thought Planning in RAG Agents
Chain-of-thought (CoT) planning elevates agentic RAG beyond reactive tool-calling. Instead of simply deciding the next action, the agent constructs a multi-step plan before executing anything.
Query Decomposition
Complex queries get broken into sub-queries that can each be answered independently:
Original: "Compare the cost-effectiveness of Pinecone vs Weaviate
for a 10M document RAG system with 50 QPS"
Plan:
1. Retrieve Pinecone pricing for 10M vectors
2. Retrieve Weaviate self-hosted infrastructure costs
3. Retrieve benchmark data for both at 50 QPS
4. Retrieve any case studies comparing the two at scale
5. Synthesize comparison with cost analysis
Adaptive Replanning
Plans aren't static. If step 2 returns insufficient information about Weaviate costs at scale, the agent replans:
Revised step 2a: Search for Weaviate cloud pricing tiers
Revised step 2b: Search for Weaviate self-hosted AWS infrastructure guides
This adaptability is what separates agentic RAG from simple query decomposition pipelines.
Memory Systems
Agentic RAG systems require multiple layers of memory to function effectively across interactions.
Short-Term Memory (Conversation Buffer)
The immediate conversation context. This includes the user's messages, the agent's responses, and the tool calls made during the current session. Most frameworks store this as a sliding window of the last N messages.
Working Memory (Scratchpad)
Intermediate reasoning state that persists during a single task execution. When the agent decomposes a query into five sub-questions, the answers to sub-questions 1 through 3 live in working memory while it tackles sub-question 4.
Long-Term Memory (Persistent Knowledge)
Information that persists across sessions. This can include:
- User preferences: "This user always asks about Python implementations"
- Learned facts: corrections or clarifications from previous conversations
- Retrieval patterns: which knowledge sources worked well for which query types
from langchain.memory import ConversationBufferWindowMemory, VectorStoreRetrieverMemory
short_term = ConversationBufferWindowMemory(k=10, return_messages=True)
long_term = VectorStoreRetrieverMemory(
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
memory_key="long_term_context",
)
Long-term memory effectively turns the agent into a system that gets better the more you use it. Previous interactions become retrievable knowledge, creating a feedback loop where the agent learns from its own history.
Tool Integration
Retrieval from a vector store is just one tool available to an agentic RAG system. Real-world agents combine multiple tools to answer complex queries.
Common Tool Categories
| Tool Type | Examples | Use Case |
|---|---|---|
| Knowledge retrieval | Vector search, BM25, SQL queries | Structured and unstructured knowledge lookup |
| Web search | Tavily, SerpAPI, Brave Search | Current events, recent information |
| Code execution | Python REPL, sandboxed environments | Calculations, data analysis, visualization |
| API calls | Weather, finance, CRM systems | Real-time external data |
| Document generation | PDF creation, chart rendering | Producing artifacts for the user |
Tool Selection
The agent must decide which tool to use for each sub-task. This decision is driven by the tool descriptions provided in the system prompt:
from langchain.tools import Tool
tools = [
Tool(
name="technical_docs_search",
func=technical_retriever.invoke,
description="Search internal technical documentation. Use for questions "
"about architecture, APIs, and implementation details.",
),
Tool(
name="web_search",
func=tavily_search.invoke,
description="Search the web for recent information. Use when the knowledge "
"base might be outdated or when the question involves current events.",
),
Tool(
name="python_repl",
func=python_repl.run,
description="Execute Python code. Use for calculations, data analysis, "
"or when you need to process/transform retrieved data.",
),
]
Clear, specific tool descriptions are critical. Vague descriptions lead to the agent choosing the wrong tool, which compounds errors across the agent loop.
Multi-Agent Architectures
Single-agent systems hit a ceiling when tasks require deep specialization across domains. Multi-agent architectures solve this by assigning different agents to different roles.
Orchestrator Pattern
A central orchestrator agent receives the user query, plans the approach, and delegates sub-tasks to specialist agents:
Orchestrator Agent
├── Retrieval Agent (manages vector stores, handles query optimization)
├── Analysis Agent (processes retrieved data, performs calculations)
├── Web Research Agent (searches the web, validates freshness)
└── Response Agent (synthesizes outputs, handles formatting)
Each specialist agent has its own system prompt, tool set, and potentially its own model (a cheaper model for simple retrieval, a more capable model for analysis).
Supervisor Pattern
Similar to the orchestrator, but the supervisor monitors agent outputs and can intervene — rejecting low-quality responses, requesting retries, or reassigning tasks.
Peer-to-Peer Pattern
Agents communicate directly with each other without a central coordinator. Agent A retrieves documents and passes them to Agent B for analysis. Agent B might ask Agent A for additional context. This works well for well-defined workflows but becomes chaotic for open-ended tasks.
Building a ReAct RAG Agent with LangGraph
Here's a practical implementation of a ReAct RAG agent using LangGraph, which provides fine-grained control over the agent loop:
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.tools import Tool
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.load_local("knowledge_base", embeddings,
allow_dangerous_deserialization=True)
def search_knowledge_base(query: str) -> str:
"""Search the internal knowledge base for relevant information."""
docs = vectorstore.similarity_search(query, k=4)
return "\n\n".join(
f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
for d in docs
)
tools = [
Tool(
name="search_knowledge_base",
func=search_knowledge_base,
description="Search the internal documentation knowledge base. "
"Returns relevant passages with source attribution.",
),
]
model = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)
SYSTEM_PROMPT = """You are a research assistant with access to an internal
knowledge base. For each question:
1. Think about what information you need.
2. Search the knowledge base with specific, targeted queries.
3. Evaluate whether retrieved information answers the question.
4. If not, search again with a refined query.
5. Synthesize a comprehensive answer with source citations.
Never guess. Always ground your answers in retrieved evidence."""
def should_continue(state: MessagesState) -> str:
last = state["messages"][-1]
if last.tool_calls:
return "tools"
return END
def call_model(state: MessagesState) -> dict:
messages = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
response = model.invoke(messages)
return {"messages": [response]}
graph = StateGraph(MessagesState)
graph.add_node("agent", call_model)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")
app = graph.compile()
result = app.invoke({
"messages": [("user", "What embedding models work best for code search?")]
})
print(result["messages"][-1].content)
This implementation gives the agent full control over the retrieval loop. It can call the search tool multiple times with different queries, reason about intermediate results, and only generate a final response when it's satisfied with the evidence.
For a Node.js equivalent using LangChain.js:
import { ChatOpenAI } from "@langchain/openai";
import { createReactAgent } from "@langchain/langgraph/prebuilt";
import { tool } from "@langchain/core/tools";
import { z } from "zod";
import { FaissStore } from "@langchain/community/vectorstores/faiss";
import { OpenAIEmbeddings } from "@langchain/openai";
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const vectorStore = await FaissStore.load("knowledge_base", embeddings);
const searchKnowledgeBase = tool(
async ({ query }) => {
const docs = await vectorStore.similaritySearch(query, 4);
return docs
.map((d) => `[Source: ${d.metadata.source}]\n${d.pageContent}`)
.join("\n\n");
},
{
name: "search_knowledge_base",
description: "Search the internal knowledge base for relevant information.",
schema: z.object({ query: z.string().describe("The search query") }),
}
);
const agent = createReactAgent({
llm: new ChatOpenAI({ model: "gpt-4o", temperature: 0 }),
tools: [searchKnowledgeBase],
});
const result = await agent.invoke({
messages: [{ role: "user", content: "What embedding models work best for code search?" }],
});
Challenges in Agentic RAG
Agent Reliability
Agents can enter loops, choose wrong tools, or hallucinate reasoning steps. Mitigation strategies include:
- Max iteration limits — cap the agent loop at 5–10 iterations.
- Structured output schemas — force the agent to produce typed outputs at each step.
- Fallback pipelines — if the agent fails after N attempts, fall back to standard RAG.
Cost Management
Each agent loop iteration incurs LLM API costs. A single complex query might trigger 5–8 LLM calls plus multiple retrieval operations. Strategies to manage this:
- Use cheaper models for routing and tool selection, reserving capable models for synthesis.
- Cache tool outputs to avoid redundant retrievals.
- Set token budgets per query and gracefully degrade when budgets are reached.
Safety and Guardrails
Agentic systems amplify both the capabilities and risks of LLMs. If an agent has access to a code execution tool or API calls, a prompt injection in a retrieved document could trigger unintended actions. Essential guardrails include input validation on all tool calls, sandboxed execution environments, and human-in-the-loop approval for high-stakes actions.
For a deeper dive into reliability and safety in production systems, see Building a Production-Ready RAG System.
When to Go Agentic
Not every RAG system needs agents. Use agentic RAG when:
- Questions regularly require multi-step reasoning or cross-referencing multiple sources.
- Users need the system to combine retrieval with computation, code execution, or API calls.
- Query complexity varies widely and a fixed pipeline can't adapt.
Stick with standard RAG when queries are straightforward factual lookups against a single knowledge source. The added complexity, cost, and latency of agents aren't justified for simple Q&A.
References
- Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629.
- Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022.
- Shinn, N., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.
- LangGraph Documentation. "Build Stateful Agents." https://langchain-ai.github.io/langgraph/
- Weng, L. (2023). "LLM Powered Autonomous Agents." lilianweng.github.io.
- Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761.
FAQ
How does agentic RAG differ from standard RAG with query rewriting?
Query rewriting is a single preprocessing step — the rewritten query still follows the fixed retrieve-then-generate pipeline. Agentic RAG makes retrieval iterative. The agent evaluates retrieved results and can issue multiple rounds of retrieval with different queries, switch between tools, and reason about when it has enough information. It's the difference between one optimized search versus a full research workflow.
What's the latency impact of adding agents to a RAG pipeline?
Each agent loop iteration adds one LLM inference call (typically 1–3 seconds for GPT-4o). A simple query might resolve in 1–2 iterations (comparable to standard RAG), while complex queries might take 5–8 iterations (10–25 seconds total). Streaming the agent's intermediate reasoning to the user helps manage perceived latency. For latency-sensitive workloads, consider routing simple queries to standard RAG and only invoking the agent for complex ones.
Can I use open-source models for agentic RAG?
Yes, but model capability matters significantly. Tool-calling and structured reasoning require models that reliably follow complex instructions. Models like Llama 3.1 70B+ and Mixtral 8x22B handle basic ReAct patterns well. For multi-agent systems with complex planning, you'll currently get better reliability with frontier models. The gap is narrowing as open models improve at function calling and instruction following.
How do I prevent infinite loops in agent execution?
Set hard limits at multiple levels: maximum iterations per agent loop (typically 5–10), maximum total tool calls per query, and a global timeout. Additionally, implement loop detection by tracking the last N actions — if the agent repeats the same tool call with the same parameters, force it to either try a different approach or generate a response with what it has.
When should I use multi-agent vs single-agent architectures?
Start with a single agent. Multi-agent systems add coordination overhead and debugging complexity. Graduate to multi-agent when you have clearly distinct domains that benefit from specialized prompts and tool sets (e.g., a code analysis agent and a documentation search agent), or when a single agent's tool list grows beyond 8–10 tools and it starts selecting poorly. For most RAG applications, a well-designed single agent with 3–5 focused tools outperforms a poorly orchestrated multi-agent system.
Related Posts

Building a Production-Ready RAG System: From Prototype to Deployment
A complete guide to building production RAG systems — tech stack selection, data ingestion pipelines, chunking strategies, evaluation frameworks, and deployment architecture with code examples.

The Complete Guide to Retrieval-Augmented Generation (RAG)
Everything you need to know about RAG — from fundamentals and architecture to production deployment. The definitive guide for developers building AI systems with retrieval-augmented generation.

RAG Architecture Deep Dive: Embeddings, Vector Databases, and Retrieval
A technical deep dive into RAG architecture — embeddings models, vector database comparison (FAISS, Pinecone, Weaviate, Chroma), retrieval strategies, and system design patterns.