Chapter 1: Why LangGraph + MCP Is the Production Agent Stack of 2026
The agentic AI conversation in 2024 was about whether agents would work. By mid-2026, that question has been replaced. Production agents ship at every scale — from solo founder side-projects to Fortune 500 deployments — and the architecture has converged. The dominant stack is AI agents with LangGraph as the orchestration runtime, paired with the Model Context Protocol (MCP) as the tool delivery layer. This eguide is the working build manual for that stack, written for engineers shipping production agents in 2026.
The combination matters because each piece solves a problem the other did not. LangGraph is a stateful graph runtime: agents are nodes, state is explicit, transitions are documented, checkpointing is first-class, and human-in-the-loop pauses are a built-in primitive rather than a hack. MCP is Anthropic‘s open standard for serving tools over HTTP or stdio: a tool catalog runs as its own service, agents discover available tools dynamically, and the contract between agent and tool is versioned and discoverable.
Before this stack converged, every team building agents reinvented the same plumbing: how does the agent decide what to do next? How do we add a new tool without redeploying the agent? How do we resume a failed agent run? How do we let a human approve a step before the agent commits an irreversible action? LangGraph + MCP gives you opinionated answers to all of these.
The mental model
Think of a production agent system as three layers stacked on top of each other.
| Layer | Responsibility | Concrete components |
|---|---|---|
| Orchestration | What runs next, with what state, after what input | LangGraph nodes, edges, conditional routing, checkpointer |
| Capability | What the agent can do in the world | MCP servers exposing tools, resources, prompts |
| Reasoning | What the agent decides to do | LLM calls (Claude, GPT, Gemini, Llama) that the orchestration invokes |
The split is not arbitrary. Orchestration evolves on a different timescale than capability, which evolves on a different timescale than reasoning. New tools can land in the MCP layer without touching agent code. New models can be swapped at the reasoning layer without changing tools or graph structure. New routing logic can be added at the orchestration layer without affecting either. This separation of concerns is what makes the stack production-grade rather than a research demo.
Why other stacks lost
Several agent frameworks competed for the production slot in 2024-2026. The shape of the winners and losers tells you what the production market actually demands.
- Pure prompt-chaining frameworks (early LangChain patterns, simple ReAct loops) lost because they could not handle long-running, resumable workflows. Production agents need to survive process restarts, network failures, and human review pauses. Stateless prompt-chaining cannot.
- Heavyweight enterprise agent platforms (proprietary, vendor-controlled) lost on developer ergonomics and lock-in. Engineering teams want to swap models, swap deployment targets, and own their stack.
- Custom-built one-off systems lost because the maintenance cost compounds. Every team that built its own agent runtime has by now replaced it with LangGraph or migrated agents to a hosted equivalent.
- Tool-format-of-the-week approaches (function calling, OpenAI Plugins, custom JSON schemas) lost because tool definitions did not survive model changes. MCP wins because it is model-agnostic: the tool contract does not care whether Claude or GPT invokes it.
Who this guide is for
This eguide assumes you are an engineer or engineering leader responsible for shipping AI agents in 2026. You write code, you understand HTTP, you have shipped non-trivial Python or TypeScript services. You may be brand new to LangGraph or MCP — the chapters that follow start from first principles and build a complete production system.
By the end of Chapter 12, you will have a working multi-agent system using LangGraph + MCP, deployed in a production-ready pattern, with observability, authentication, rate limiting, and a clear picture of how to extend the system as your needs grow.
How this guide is organized
Chapters 2-3 lay foundations. Chapters 4-6 walk through building the agent system, layer by layer. Chapters 7-9 cover the production-grade concerns that separate a demo from a deployment. Chapters 10-12 cover deployment patterns, optimization, and the operational discipline that keeps agents alive in production.
If you are short on time, the highest-leverage chapters for a deployment decision are 5 (MCP integration), 7 (state and human-in-the-loop), and 10 (deployment patterns). Read those first. The rest fills in essential context.
The team profile that ships agents in 2026
Successful production agent teams share characteristics. The composition that consistently ships:
- One technical lead with strong async Python or TypeScript, deep familiarity with LLMs, and willingness to own the operational concerns.
- One or two implementation engineers who can build nodes, tools, and integrations under the lead’s guidance.
- One product or domain expert who knows what the agent should do and can write evaluation rubrics.
- Shared access to a designer for any user-facing interfaces and a security/compliance reviewer for high-stakes deployments.
The agent that ships fastest is rarely built by the largest team. Three to four people focused on the work outperform eight people coordinating. Add specialization (dedicated infrastructure, dedicated prompt engineering) only when scale demands it, typically after the first production deployment proves the value.
The economics that make agents worth building
Before committing to an agent build, validate the business case. Production agents are not free. The cost components include LLM tokens (typically $0.50-$5.00 per agent invocation depending on complexity), infrastructure (the API gateway, MCP servers, checkpointer storage), engineering time (a working production system is 4-8 engineer-weeks for a competent team), and ongoing operations (observability, on-call, prompt iteration).
The agents that justify this cost share characteristics: they replace high-volume work that would otherwise consume human time, they handle tasks where 80-95% completion accuracy is acceptable with human review for the rest, and they integrate with systems where the output drives concrete business actions (resolving tickets, writing reports, processing transactions). Agents that try to replace small-volume specialist work or that need 99%+ accuracy on every output rarely return their investment.
Run the math before you build. A customer-support agent that replaces $8 of human handling per ticket and runs 1,000 tickets per day produces $8,000 of daily savings. That justifies serious engineering investment. A research agent that runs 5 times per day for an analyst saves perhaps 30 minutes per run; the math may not justify the same investment unless the analyst’s time is genuinely scarce.
The model selection question
Picking the right LLM for each agent role is the single biggest performance and cost lever. The 2026 landscape is mature enough that some defaults work for most cases:
| Role | Recommended models | Why |
|---|---|---|
| Supervisor / router | Claude Haiku 4.5, GPT-4o-mini, Gemini Flash 2.5 | Fast, cheap, good at single-token decisions |
| Research specialist | Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro | Strong reasoning + tool use; good context handling |
| Code specialist | Claude Opus 4.7, GPT-5.5 | Best-in-class coding reasoning; deep tool integration |
| Writing specialist | Claude Opus 4.7, GPT-5.5 | Tone control, long-form coherence |
| Data analysis specialist | Claude Sonnet 4.6, GPT-5.4 | Strong at SQL, dataframe manipulation, structured output |
| Patient-facing chat | Claude Sonnet 4.6, GPT-5.4 with safety tuning | Warmth + careful refusals |
Multi-provider deployments — where different specialists call different providers — are common in production. The architectural caveat: build the agent layer with provider abstraction so model swaps are configuration changes, not code changes.
What this guide does not cover
To keep the scope tight, this eguide focuses on building production agents with LangGraph and MCP. It does not cover: training your own LLMs (use commercial APIs unless you have a specific reason not to), building your own MCP server protocol from scratch (use the existing SDKs), or fine-tuning models for agent use (rarely necessary in 2026 — frontier models tool-call well out of the box). It also does not cover non-agentic AI use cases like classification, embedding-based search, or single-shot generation; those use different patterns and rarely benefit from agent architectures.
Chapter 2: Core Concepts — Graphs, State, Nodes, and MCP Servers
Before you write code, you need a clear mental model of what LangGraph and MCP actually are. Skipping this chapter leads to confused debugging later.
LangGraph in three sentences
LangGraph is a Python (and TypeScript) library that lets you express an agent system as a directed graph. Each node in the graph is a function (typically an LLM call or a tool invocation) that reads from a shared state object and writes back updates. Edges connect nodes, can be conditional based on state, and define how the graph progresses through a single run.
The graph is not a one-shot pipeline. It is a stateful machine that can pause for human approval, resume from a checkpoint, branch based on LLM decisions, and persist its progress between process restarts. This is what makes it production-grade.
The State object
State is a TypedDict (or Pydantic model) that defines the shape of data flowing through the graph. Every node receives the current state and returns a partial update. The runtime merges updates back into state automatically.
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph
from langchain_core.messages import BaseMessage
from operator import add
class AgentState(TypedDict):
messages: Annotated[list[BaseMessage], add] # appended each turn
next_agent: str # supervisor decision
research_findings: list[str] # accumulated as we go
code_artifacts: dict[str, str] # file_path -> contents
pending_approval: bool # human-in-the-loop flag
user_id: str # tenant scope
The Annotated type with a reducer (here, the add operator) tells LangGraph how to merge concurrent updates. Without that annotation, two nodes writing to the same field would clobber each other. With it, the runtime accumulates correctly.
Nodes and edges
Nodes are functions. Edges are connections between nodes. Both have specific properties:
- Nodes are pure functions of state. A node receives the current state and returns a partial state update. Side effects (HTTP calls, database writes) happen inside the node, but the node’s contract with the runtime is purely state-in, state-out.
- Edges can be unconditional or conditional. Unconditional edges always go from A to B. Conditional edges read state and route to one of multiple destinations. Conditional routing is how the supervisor pattern works.
- Special nodes: START and END. Every graph has a START node where execution begins and one or more END nodes that terminate the run. Multiple END states are allowed; one for success, one for failure, one for human-review-required.
MCP in three sentences
MCP — Model Context Protocol — is an open standard published by Anthropic in 2024 and adopted broadly through 2025-2026. An MCP server is a process (local or remote) that exposes a catalog of tools, resources, and prompts over HTTP, SSE, or stdio. An MCP client (your agent) connects to one or more MCP servers, discovers their catalogs, and calls them as needed.
The reason MCP matters for production: tools are no longer hardcoded into agent code. A new tool comes online when an MCP server is updated; it becomes available to the agent on the next discovery refresh. Tool versioning, deprecation, and tenant-specific tool sets all become operational tasks instead of code changes.
The MCP transport choice
MCP supports three transports: stdio (subprocess), HTTP, and Server-Sent Events (SSE) over HTTP. Each fits a different deployment shape:
| Transport | Best for | Trade-offs |
|---|---|---|
| stdio | Local development, single-machine deployments | No network, no concurrent multi-agent; fast iteration |
| HTTP | Production, distributed, multi-agent | Standard web infrastructure; most flexible |
| SSE | Streaming long-running tools | Required for tools that emit progress; modest complexity |
Production deployments overwhelmingly use HTTP. Reach for stdio only during local development; reach for SSE only when a tool genuinely needs streaming progress events.
State as a contract
The AgentState schema is a contract. Once nodes are written against a state shape, changing it is a breaking change. Treat the state schema with the same rigor as a public API. Specifically:
- Version it. Even if you start with one shape, pre-plan the path to v2.
- Document each field. What it represents, who reads it, who writes it.
- Avoid premature unions. A field that means different things at different times is a bug factory.
- Keep it small. Every field that flows through every node is overhead. Move per-step ephemeral data to local variables instead.
The reducer functions that matter
Annotated state fields with reducers are how concurrent updates merge. A few reducer patterns worth knowing:
- add (operator.add). Concatenates lists or sums numbers. Most common for messages and accumulating findings.
- last_value. Custom reducer that keeps the most recent write. Useful for status fields where only the latest matters.
- merge_dict. Custom reducer that deep-merges dictionaries. Useful for fields like code_artifacts where multiple agents add files.
- append_unique. Custom reducer that appends only items not already present. Useful for tags, references, or de-duplicated findings.
from typing import TypedDict, Annotated
def merge_dict(left: dict, right: dict) -> dict:
return {**left, **right}
def append_unique(left: list, right: list) -> list:
seen = set(left)
return left + [x for x in right if x not in seen and not seen.add(x)]
class State(TypedDict):
code_artifacts: Annotated[dict[str, str], merge_dict]
references: Annotated[list[str], append_unique]
The MCP capability surface
An MCP server exposes three categories of capability:
- Tools. Functions the agent can call. Each tool has a name, description, and JSON schema for parameters. This is the most-used capability and what most production servers focus on.
- Resources. Read-only data the agent can fetch. Useful for agent context that should not be tool-call-shaped: configuration, documentation, recent activity logs.
- Prompts. Reusable prompt templates that the server publishes. Enables a “prompt library” pattern where common reasoning patterns get standardized server-side.
Most production MCP servers use only tools. Resources and prompts become useful as the system grows; both are good extensions when you need them but not required for a working v1.
Why the protocol abstraction matters
Beyond the technical mechanics, MCP solves an organizational problem. In pre-MCP agent systems, the team that builds the agent and the team that builds the integrations had to coordinate every change. Adding a new internal tool meant a release of the agent service. Versioning a tool meant agent rollbacks if compatibility broke.
With MCP, the integration team owns its servers. The agent team consumes whatever the integration team publishes. The two teams ship independently. Tool deprecation, versioning, and migration become the integration team’s operational concerns, not the agent team’s deployment dependencies. This separation of teams is what makes the architecture scale to large organizations.
Chapter 3: Setting Up a Development Environment
The fastest way to lose hours on agent development is to fight your environment. This chapter walks through a clean setup that works on macOS, Linux, and Windows (via WSL2 for the smoothest experience).
Python and dependency management
LangGraph requires Python 3.11 or newer. The agent ecosystem moves fast; use a recent Python and a real package manager.
# Create a project with uv (recommended for speed and reproducibility)
uv init agent-system
cd agent-system
# Pin Python and add dependencies
uv python install 3.12
echo "3.12" > .python-version
uv add langgraph langchain langchain-anthropic langchain-openai \
langchain-mcp-adapters mcp \
fastapi uvicorn python-dotenv pydantic redis \
langsmith opentelemetry-api opentelemetry-sdk
uv add --dev pytest pytest-asyncio ruff mypy
Environment variables
Production agents need API keys, model endpoints, and observability credentials. Manage them through environment variables loaded from a .env file in development and from a secrets manager in production.
# .env (development only — never commit)
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
# LangSmith for tracing (optional but strongly recommended)
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agent-system-dev
LANGSMITH_TRACING=true
# MCP server endpoints
MCP_RESEARCH_URL=http://localhost:8001/mcp
MCP_CODE_URL=http://localhost:8002/mcp
# Agent runtime config
AGENT_MAX_ITERATIONS=20
AGENT_TIMEOUT_SECONDS=180
CHECKPOINT_REDIS_URL=redis://localhost:6379/0
Local infrastructure
For development, you need a Redis (for checkpointing), a PostgreSQL (for production-grade checkpoints), and your MCP servers running. Use docker-compose:
# docker-compose.yml
services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes: ["redis-data:/data"]
postgres:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: dev
POSTGRES_DB: agents
ports: ["5432:5432"]
volumes: ["postgres-data:/var/lib/postgresql/data"]
mcp-research:
build: ./mcp-servers/research
ports: ["8001:8001"]
environment:
- SERPAPI_KEY=${SERPAPI_KEY}
mcp-code:
build: ./mcp-servers/code
ports: ["8002:8002"]
volumes:
- ./code-sandbox:/sandbox
volumes:
redis-data:
postgres-data:
Project layout
The repository structure that scales from POC to production:
agent-system/
├── pyproject.toml
├── .env
├── docker-compose.yml
├── src/
│ ├── agents/
│ │ ├── __init__.py
│ │ ├── state.py # AgentState definition
│ │ ├── supervisor.py # supervisor node
│ │ ├── research.py # research agent node
│ │ ├── code.py # code agent node
│ │ └── graph.py # graph assembly
│ ├── mcp/
│ │ ├── client.py # MCP client setup
│ │ └── tools.py # tool helpers
│ ├── observability/
│ │ ├── tracing.py
│ │ └── metrics.py
│ ├── api/
│ │ ├── server.py # FastAPI gateway
│ │ └── auth.py
│ └── config.py
├── mcp-servers/
│ ├── research/
│ └── code/
├── tests/
└── docs/
This layout separates the agent code, the MCP servers, the API gateway, and observability concerns. As the system grows, each piece can be deployed independently.
Secrets management
Production secret management is non-negotiable. Three patterns that work:
- Cloud secret manager. AWS Secrets Manager, GCP Secret Manager, Azure Key Vault. The agent service reads secrets at startup and refreshes on rotation. Most operationally clean.
- Kubernetes secrets with sealed-secrets. If you run on K8s, sealed-secrets keeps encrypted secrets in version control with the cluster decrypting at runtime.
- HashiCorp Vault. For organizations with strong audit requirements. More operational overhead but the strongest controls.
Avoid: secrets in environment variables baked into container images, secrets in CI/CD logs, and secrets in commit history. Each has produced production breaches in the past two years.
Editor and IDE setup
Agent development is iterative. The faster your edit-test loop, the faster you ship. Recommended editor configuration:
- Type checking. Pyright or mypy in strict mode. Agent state types catch most agent bugs before runtime.
- Auto-formatting. Ruff for both linting and formatting. Run on save.
- LangGraph studio (optional). Visual graph debugger that shows state transitions. Useful when graph structure gets complex.
- HTTP client. An HTTPie, curl-with-aliases, or Bruno setup for poking at MCP servers and the API gateway during development.
Configuration management
Beyond secrets, agents need configuration: model names, prompt templates, tool endpoints, iteration limits, timeout values. Three patterns for managing this:
- Environment variables. Simple, works for small numbers of values, less manageable for prompt templates.
- YAML config files. Version-controlled config that the agent reads at startup. Good for prompt templates and structured config.
- Feature flag service. Dynamic configuration that changes without redeploy. Required for runtime adjustment of model selection, iteration limits, and rollback decisions.
The pattern that scales: feature flag service for dynamic values, YAML files for prompt templates and tool definitions, environment variables for static infrastructure config. Mixing all three is normal in production.
Test infrastructure
Agent code is testable, despite the LLM in the loop. Two test patterns dominate:
- Unit tests for nodes with mocked LLMs. Pass synthetic state in, assert on the state update returned. Mock the LLM to return canned responses for specific inputs.
- Integration tests against real LLMs with deterministic seeds. Set temperature=0 and run a known input through the full graph. Snapshot test the output.
# tests/test_supervisor.py
import pytest
from unittest.mock import patch
from src.agents.supervisor import supervisor_node
def test_supervisor_routes_to_research_for_factual_query():
with patch("src.agents.supervisor.supervisor_llm") as mock_llm:
mock_llm.invoke.return_value.content = "research"
state = {"messages": [
{"role": "user", "content": "What's the population of Tokyo?"}
], "next_agent": "", "research_findings": [], "code_artifacts": {},
"pending_approval": False, "user_id": "test"}
result = supervisor_node(state)
assert result["next_agent"] == "research"
def test_supervisor_finishes_when_question_answered():
with patch("src.agents.supervisor.supervisor_llm") as mock_llm:
mock_llm.invoke.return_value.content = "FINISH"
state = {"messages": [
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello! How can I help?"}
], "next_agent": "", "research_findings": [], "code_artifacts": {},
"pending_approval": False, "user_id": "test"}
result = supervisor_node(state)
assert result["next_agent"] == "FINISH"
Verifying the setup
Before writing the first agent, verify the stack works end-to-end:
# sanity_check.py
import os
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-opus-4-7", api_key=os.environ["ANTHROPIC_API_KEY"])
print(llm.invoke("Reply with: hello").content)
# Should print "hello" or close to it.
# If you see an auth error, your API key is wrong.
# If you see a network error, check connectivity.
# If you see a model name error, verify available models for your account.
Chapter 4: Building Your First Single-Agent ReAct Loop
Before tackling multi-agent supervision, build a single agent. The ReAct (Reasoning + Acting) pattern is the foundation: the agent loops between thinking, deciding which tool to call, calling the tool, observing the result, and deciding what to do next. LangGraph makes this loop explicit and resumable.
The minimum ReAct graph
A working single-agent ReAct graph has two nodes and a conditional edge:
from typing import Literal
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from src.agents.state import AgentState
# Define the tools the agent can use (we'll wire MCP in chapter 5)
@tool
def web_search(query: str) -> str:
"""Search the web and return top results."""
# Implementation omitted; replace with your search backend
return f"Mock results for: {query}"
@tool
def fetch_page(url: str) -> str:
"""Fetch a URL and return its text content."""
return f"Mock page content from: {url}"
tools = [web_search, fetch_page]
tool_node = ToolNode(tools)
llm = ChatAnthropic(model="claude-opus-4-7").bind_tools(tools)
def agent_node(state: AgentState) -> dict:
response = llm.invoke(state["messages"])
return {"messages": [response]}
def should_continue(state: AgentState) -> Literal["tools", "__end__"]:
last = state["messages"][-1]
if isinstance(last, AIMessage) and last.tool_calls:
return "tools"
return "__end__"
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
graph.add_edge(START, "agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", "__end__": END})
graph.add_edge("tools", "agent")
app = graph.compile()
Running the graph
Once compiled, the graph is invocable. The runtime handles the loop between agent and tools automatically.
result = app.invoke({
"messages": [HumanMessage(content="What's the weather in San Francisco today?")],
"next_agent": "",
"research_findings": [],
"code_artifacts": {},
"pending_approval": False,
"user_id": "demo-user-123",
})
for msg in result["messages"]:
print(f"[{msg.type}] {msg.content[:200]}")
What the runtime is doing
Behind the scenes, LangGraph executes:
- START → agent: invoke the LLM with current messages
- The LLM decides whether to call a tool
- If yes: should_continue routes to “tools”, ToolNode invokes the tool, appends a ToolMessage to messages
- tools → agent: loop back to step 1
- If no: should_continue routes to END
The loop terminates when the LLM produces a response without tool calls. Without explicit iteration limits, a buggy agent could loop indefinitely; we add limits in Chapter 11.
Streaming the response
Production UIs need streaming. LangGraph supports streaming both intermediate node outputs and individual LLM tokens:
async for event in app.astream(inputs, stream_mode="values"):
last_message = event["messages"][-1]
if isinstance(last_message, AIMessage):
print(last_message.content)
# For token-level streaming inside a node, use stream_mode="messages"
async for event in app.astream(inputs, stream_mode="messages"):
chunk, metadata = event
if hasattr(chunk, "content") and chunk.content:
print(chunk.content, end="", flush=True)
Limitations of single-agent
The single-agent pattern works for narrow tasks. As the task complexity grows, you hit limits:
- Long context fills with tool calls and results, eventually crowding out reasoning
- One agent doing everything cannot specialize on different tool subsets
- Errors compound; one bad tool call cascades
- Reasoning quality degrades with too many tools in the prompt
The fix is multi-agent supervision, which we build in Chapter 6 after wiring MCP in Chapter 5.
The system prompt for a single agent
The system prompt shapes how the agent behaves. For a single-agent ReAct loop, a working pattern:
SYSTEM_PROMPT = """You are an AI assistant helping a user accomplish their task.
You have access to tools. Use them when you need information or to take actions.
Guidelines:
- Think step by step before calling a tool. Explain what you're about to do.
- Call only one tool per turn unless the situation requires parallelism.
- After receiving tool results, decide whether to call another tool or respond to the user.
- If a tool fails, try to recover (different parameters, different tool, or admit defeat gracefully).
- When you have enough information to answer the user, respond directly.
- Be concise. Users hate verbose AI responses.
Available tools will be provided automatically."""
# Inject as the first message
from langchain_core.messages import SystemMessage
agent_node_state = state.copy()
agent_node_state["messages"] = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
Common ReAct loop bugs
Single-agent ReAct loops have a small set of failure modes that come up repeatedly:
- Infinite loop without tool calls. The LLM keeps responding without calling tools, never finishing. Fix: add iteration cap.
- Hallucinated tool names. The LLM calls a tool that doesn’t exist. Fix: validate tool names before invocation; respond with a helpful error if invalid.
- Bad tool parameters. The LLM passes wrong types or missing required fields. Fix: schema validation before invocation; structured error responses that the LLM can recover from.
- Tool result not addressed. The LLM ignores tool results and tries the same query again. Fix: clearer system prompt; sometimes a smaller, more focused model works better here.
Chapter 5: Adding MCP-Served Tools to the Agent
Hardcoded tools work in development but break the operational benefits of the stack. In production, tools are served by MCP servers — separate processes that publish a catalog and accept invocation calls. This chapter wires that in.
The langchain-mcp-adapters bridge
The langchain-mcp-adapters package converts MCP tool manifests into LangChain tools that work natively with LangGraph. The bridge handles tool discovery, schema translation, and call routing.
from langchain_mcp_adapters.client import MultiServerMCPClient
client = MultiServerMCPClient({
"research": {
"url": "http://localhost:8001/mcp",
"transport": "streamable_http",
},
"code": {
"url": "http://localhost:8002/mcp",
"transport": "streamable_http",
},
})
# Open the connection (reuse for the process lifetime in production)
async with client:
tools = client.get_tools()
print(f"Discovered {len(tools)} tools across MCP servers")
for t in tools:
print(f" - {t.name}: {t.description[:60]}")
Building a minimal MCP server
For the development environment, you need at least one MCP server running. The Python MCP SDK provides a minimal server template:
# mcp-servers/research/server.py
from mcp.server.fastmcp import FastMCP
import httpx
import os
mcp = FastMCP("research-mcp")
@mcp.tool()
async def web_search(query: str) -> str:
"""Search the web and return top results."""
api_key = os.environ["SERPAPI_KEY"]
async with httpx.AsyncClient(timeout=15) as http:
r = await http.get("https://serpapi.com/search",
params={"q": query, "api_key": api_key, "num": 5})
data = r.json()
results = data.get("organic_results", [])
return "\n".join(
f"- {x.get('title')}: {x.get('snippet')} ({x.get('link')})"
for x in results
)
@mcp.tool()
async def fetch_page(url: str) -> str:
"""Fetch a URL and return its text content."""
async with httpx.AsyncClient(timeout=15, follow_redirects=True) as http:
r = await http.get(url)
return r.text[:8000] # cap to avoid context bloat
if __name__ == "__main__":
mcp.run(transport="streamable-http", host="0.0.0.0", port=8001)
Wiring MCP tools into the agent
Replace the hardcoded tools from Chapter 4 with MCP-discovered tools:
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent
from langchain_anthropic import ChatAnthropic
async def build_agent():
client = MultiServerMCPClient({
"research": {"url": os.environ["MCP_RESEARCH_URL"], "transport": "streamable_http"},
"code": {"url": os.environ["MCP_CODE_URL"], "transport": "streamable_http"},
})
# Hold the connection open for the process lifetime
await client.__aenter__()
tools = client.get_tools()
llm = ChatAnthropic(model="claude-opus-4-7")
agent = create_react_agent(llm, tools)
return agent, client
# In your application startup
agent, mcp_client = await build_agent()
# In your application shutdown (FastAPI lifespan, etc.)
await mcp_client.__aexit__(None, None, None)
Why hold the client open
The single most common production bug with MCP is opening and closing the client per-request. This pattern works but burns latency on every connection setup and overwhelms MCP servers under load. Open the client once at process startup, reuse it across all agent invocations, close it on shutdown.
Tool versioning and discovery
Production MCP deployments need clear versioning. Two patterns:
- Server-level version pinning. Each MCP server endpoint includes the version in the URL:
/mcp/v1,/mcp/v2. Old agents continue working against old endpoints; new agents use new endpoints. - Tool-level versioning. Tools include a version field. The MCP server can serve multiple versions of the same tool with deprecation timestamps.
Combine both for maximum flexibility. The discovery refresh interval — how often the agent re-fetches the tool catalog — defaults to once at startup but can be configured to refresh on a schedule for long-running processes.
Authentication for MCP servers
Production MCP servers need authentication. The MCP spec is intentionally agnostic about authentication mechanism; the most common patterns:
- API key in header. Simple. Each agent process holds an API key for each MCP server it uses. Rotate periodically.
- mTLS. Mutual TLS between agent service and MCP servers. Stronger guarantees, more operational overhead. Right for sensitive tools (financial transactions, infrastructure changes).
- Service-mesh identity. If you run on Kubernetes with Istio or similar, the service mesh handles identity automatically. Often the easiest path for K8s-heavy organizations.
client = MultiServerMCPClient({
"research": {
"url": os.environ["MCP_RESEARCH_URL"],
"transport": "streamable_http",
"headers": {
"Authorization": f"Bearer {os.environ['MCP_RESEARCH_TOKEN']}",
},
},
# ... other servers
})
Tool input validation
The LLM occasionally produces tool calls with wrong-shaped arguments. The MCP server should validate inputs and return informative errors. Error messages get fed back to the LLM, which can correct on the next turn.
from pydantic import BaseModel, Field, ValidationError
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("research-mcp")
class WebSearchArgs(BaseModel):
query: str = Field(..., min_length=1, max_length=500)
num_results: int = Field(5, ge=1, le=20)
site_filter: str | None = Field(None, max_length=100)
@mcp.tool()
async def web_search(query: str, num_results: int = 5,
site_filter: str | None = None) -> str:
"""Search the web. Returns top results with snippets and URLs."""
try:
args = WebSearchArgs(query=query, num_results=num_results,
site_filter=site_filter)
except ValidationError as e:
# Return error string the LLM can read and correct
return f"INVALID ARGUMENTS: {e.errors()[0]['msg']}"
# Proceed with the validated args
results = await search_backend(args.query, args.num_results, args.site_filter)
return format_results(results)
Tool result formatting
How tool results are formatted shapes how well the LLM uses them. Patterns that work:
- Structured but compact. Use clear delimiters (newlines, dashes) but don’t pad with markdown formatting. Tokens cost money.
- Consistent shape. The same tool always returns results in the same shape. Variable shapes confuse the LLM.
- Truncation with notice. If you truncate, say so explicitly: “[truncated, 1240 more characters omitted]”. The LLM can ask for more if needed.
- Error indication. Errors return strings starting with “ERROR:” that the LLM can pattern-match on.
The mock-MCP pattern for testing
Production MCP integrations should be tested without hitting real services. A mock MCP server that responds with canned data simplifies CI:
# tests/conftest.py
import pytest
from mcp.server.fastmcp import FastMCP
@pytest.fixture
def mock_research_server(monkeypatch, tmp_path):
mock_mcp = FastMCP("mock-research")
@mock_mcp.tool()
def web_search(query: str) -> str:
return f"MOCK RESULTS for: {query}"
@mock_mcp.tool()
def fetch_page(url: str) -> str:
return f"MOCK PAGE CONTENT from: {url}"
# Run in background thread, return URL for client config
server_url = "http://localhost:18001/mcp"
# ... start server thread, return URL
yield server_url
# ... cleanup
Chapter 6: The Multi-Agent Supervisor Pattern
Single-agent ReAct hits limits as task complexity grows. The multi-agent supervisor pattern solves it by introducing specialization: a supervisor LLM decides which specialized agent should handle each turn, and the specialist agents handle their domain with focused tool sets. This is the canonical production agent topology in 2026.
The graph structure
A supervisor multi-agent graph has three nodes plus the conditional routing logic:
from langgraph.graph import StateGraph, START, END
from langchain_core.messages import AIMessage
from src.agents.state import AgentState
graph = StateGraph(AgentState)
graph.add_node("supervisor", supervisor_node)
graph.add_node("research_agent", research_node)
graph.add_node("code_agent", code_node)
graph.add_edge(START, "supervisor")
graph.add_conditional_edges(
"supervisor",
lambda s: s["next_agent"], # supervisor writes "next_agent" into state
{
"research": "research_agent",
"code": "code_agent",
"FINISH": END,
},
)
graph.add_edge("research_agent", "supervisor")
graph.add_edge("code_agent", "supervisor")
app = graph.compile()
The supervisor node
The supervisor is a small, focused LLM call. It reads the conversation, decides which specialist should act next, and writes the decision into state. It does not do any tool calling itself.
SUPERVISOR_SYSTEM = """You are a workflow supervisor. Your job is to decide which
specialist should handle the next turn given the conversation so far.
Available specialists:
- research: handles web search, fact-finding, and information retrieval
- code: handles writing, running, and testing code
Rules:
- If the user's request is fully answered, respond exactly: FINISH
- If the request needs information from the web, route: research
- If the request needs code written or executed, route: code
- Output ONLY one word: research, code, or FINISH"""
from langchain_anthropic import ChatAnthropic
supervisor_llm = ChatAnthropic(model="claude-haiku-4-5", temperature=0)
def supervisor_node(state: AgentState) -> dict:
response = supervisor_llm.invoke([
{"role": "system", "content": SUPERVISOR_SYSTEM},
*state["messages"],
])
decision = response.content.strip()
if decision not in {"research", "code", "FINISH"}:
decision = "FINISH" # fail-safe
return {"next_agent": decision}
The specialist nodes
Each specialist is itself a ReAct agent with a focused tool set, built on top of the MCP-discovered tools from Chapter 5:
research_tools = [t for t in mcp_tools if t.metadata.get("server") == "research"]
code_tools = [t for t in mcp_tools if t.metadata.get("server") == "code"]
research_agent = create_react_agent(llm, research_tools, name="research")
code_agent = create_react_agent(llm, code_tools, name="code")
async def research_node(state: AgentState) -> dict:
result = await research_agent.ainvoke({"messages": state["messages"]})
return {"messages": result["messages"][-1:]} # only the last AI msg
async def code_node(state: AgentState) -> dict:
result = await code_agent.ainvoke({"messages": state["messages"]})
return {"messages": result["messages"][-1:]}
Why this pattern wins
The supervisor pattern produces three operational benefits over single-agent ReAct:
- Tool isolation. Each specialist sees only its own tools, keeping the prompt focused and reducing tool-selection errors.
- Routing is auditable. The supervisor’s decision trace is explicit. When something goes wrong, you can see exactly which specialist was invoked and why.
- Specialists evolve independently. The research agent and code agent can be improved or swapped without affecting each other.
When to add more specialists
Two specialists is the minimum useful supervisor pattern. The natural growth path:
- Add a data agent when your workload includes structured data analysis (SQL, dataframe operations)
- Add a communication agent when your workload involves emailing, messaging, or notifying external parties
- Add a verification agent when output quality matters and you want a separate evaluator
Resist over-specialization. Five or six specialist agents is typically the sweet spot. Beyond that, the supervisor’s routing decisions become harder, and the operational complexity grows faster than the capability gain.
The supervisor system prompt patterns that work
The supervisor’s system prompt determines whether routing decisions are crisp or sloppy. Patterns that consistently produce good routing:
- Explicit specialist descriptions. Each specialist gets a 1-2 sentence description of what it does and when to use it. Vague descriptions produce vague routing.
- Single-token output. The supervisor outputs the specialist name as a single word. No commentary, no explanation. Saves tokens and produces deterministic routing.
- Examples in the prompt. 3-5 example user inputs paired with the correct routing decision. Few-shot examples produce 30-50% better routing than zero-shot prompts on the same model.
- Fallback to FINISH. When in doubt, the supervisor returns FINISH and the user gets the current state of the conversation. Loops happen when the supervisor cannot decide.
The router-vs-orchestrator distinction
Two patterns exist for the top-level node in a multi-agent graph. The router pattern (what we built) makes a routing decision and delegates to one specialist at a time. The orchestrator pattern is more aggressive: the top node plans a multi-step workflow upfront and dispatches specialists in sequence or parallel, then composes their results.
| Pattern | Strengths | Weaknesses |
|---|---|---|
| Router (supervisor) | Simple to reason about; specialists evolve independently; debug-friendly | Sequential; one specialist at a time |
| Orchestrator (planner) | Parallelizable; can produce richer outputs; faster wall-clock for parallelizable tasks | Harder to reason about; planning errors cascade; harder to debug |
Most production agents start as router patterns and add orchestrator capabilities for specific workflows where parallelism wins. Beginning with the orchestrator pattern is rarely worth the upfront complexity.
The supervisor failure modes
Supervisors fail in characteristic ways. Knowing the patterns saves debugging time:
- Premature FINISH. Supervisor decides the conversation is done before the user’s request is actually answered. Fix: stronger system prompt examples showing partial-completion cases.
- Ping-pong routing. Supervisor alternates between two specialists indefinitely. Fix: add a turn counter and force FINISH after a configured limit; investigate why specialists aren’t producing complete answers.
- Wrong specialist. Routing to the code agent for a research question. Fix: improve descriptions of each specialist’s domain; add few-shot examples to the supervisor prompt.
- No FINISH ever. Supervisor keeps routing without ever finishing. Fix: stronger FINISH instruction; ensure the system prompt includes “if the user’s question is answered, output FINISH” as the first decision rule.
The hand-off pattern
Multi-agent systems sometimes need an explicit hand-off where one specialist signals that another should take over with specific context. The pattern:
# A specialist returns a "handoff" message that the supervisor sees
class HandoffRequest(TypedDict):
target: str
reason: str
context: dict
# In a specialist node, signal a handoff
async def code_node(state):
result = await code_agent.ainvoke({"messages": state["messages"]})
last = result["messages"][-1]
if "I need research help" in last.content:
return {
"messages": [last],
"next_agent": "research", # bypass supervisor decision
"handoff_context": {"need": "documentation lookup"},
}
return {"messages": [last]}
Chapter 7: Persistent State, Checkpoints, and Human-in-the-Loop
Production agents fail mid-run. The process restarts, the network hiccups, the user closes their browser, the LLM times out. Without persistent state, every failure starts the agent over. With persistent state — and LangGraph’s checkpointer — agents resume from where they left off.
The checkpointer
A checkpointer is a backend that persists graph state. LangGraph provides several:
| Checkpointer | Backend | Best for |
|---|---|---|
| MemorySaver | In-memory dict | Tests, demos |
| SqliteSaver | SQLite file | Single-process production, low scale |
| PostgresSaver | PostgreSQL | Multi-process production at any scale |
| RedisSaver | Redis | Latency-sensitive workloads |
Wiring a checkpointer
from langgraph.checkpoint.postgres import PostgresSaver
import os
with PostgresSaver.from_conn_string(os.environ["POSTGRES_URL"]) as checkpointer:
checkpointer.setup() # creates tables on first run
app = graph.compile(checkpointer=checkpointer)
# Now invocations require a thread_id
config = {"configurable": {"thread_id": "user-123-session-456"}}
result = app.invoke(inputs, config=config)
# A second invocation with the same thread_id resumes
follow_up = app.invoke(
{"messages": [HumanMessage(content="Can you elaborate?")]},
config=config,
)
Human-in-the-loop pauses
The killer feature of stateful agents: pausing for human approval. LangGraph supports this through the interrupt_before and interrupt_after options on graph compilation:
# Pause before the code_agent ever runs (so a human approves
# before any code is executed)
app = graph.compile(
checkpointer=checkpointer,
interrupt_before=["code_agent"],
)
# First invocation — runs until the interrupt
state = app.invoke(inputs, config)
# state["__interrupt__"] is set; persistence layer holds the run
# Human reviews state, approves
# Second invocation with config alone (no inputs) resumes from interrupt
final = app.invoke(None, config)
The approval UI pattern
Production approval interfaces typically follow this pattern: the agent runs to the interrupt point, the system surfaces the pending action to a human (web UI, email, Slack), the human approves or rejects, the agent resumes or aborts. Implementing this end-to-end:
# server.py — FastAPI handlers
from fastapi import FastAPI, HTTPException
api = FastAPI()
@api.post("/agents/start")
async def start_agent(request: StartRequest):
config = {"configurable": {"thread_id": request.thread_id}}
state = await app.ainvoke({"messages": request.messages}, config=config)
if "__interrupt__" in state:
# Notify human reviewers
await notify_reviewer(thread_id=request.thread_id,
pending=state.get("pending_action"))
return {"status": "awaiting_approval", "thread_id": request.thread_id}
return {"status": "complete", "result": state}
@api.post("/agents/approve/{thread_id}")
async def approve(thread_id: str):
config = {"configurable": {"thread_id": thread_id}}
state = await app.ainvoke(None, config=config)
return {"status": "complete", "result": state}
@api.post("/agents/reject/{thread_id}")
async def reject(thread_id: str, reason: ResetRequest):
# Update state with rejection reason and route around the action
config = {"configurable": {"thread_id": thread_id}}
await app.aupdate_state(config, {"messages": [
SystemMessage(content=f"Action rejected: {reason.reason}")
]}, as_node="code_agent")
state = await app.ainvoke(None, config=config)
return {"status": "complete_with_rejection", "result": state}
What to interrupt before
Not every action needs human approval. Interrupt before actions with these properties:
- Irreversible (publishing, sending emails, financial transactions)
- High-blast-radius (mass updates, deletions)
- Regulatory or compliance-sensitive
- External-facing (touching production systems beyond the agent’s sandbox)
Avoid interrupting before reversible exploratory actions. The friction of constant approval requests destroys the agent’s value.
Time-travel debugging
The checkpointer keeps the full history of state across the run. This enables time-travel debugging: rewind the agent to a previous step, change the input, and replay forward. Useful when investigating why an agent made a particular decision:
# Get the full history of a thread
config = {"configurable": {"thread_id": "user-123-session-456"}}
history = list(app.get_state_history(config))
for state in history:
print(state.config["configurable"]["checkpoint_id"], state.values["messages"][-1])
# Replay from a specific checkpoint
target = history[3] # rewind to 4th step from end
new_config = {"configurable": {**target.config["configurable"]}}
# Update state with corrected input
app.update_state(new_config, {"messages": [HumanMessage(content="Different question")]})
# Run forward from that point
result = app.invoke(None, new_config)
The right cadence for human review
Not every action needs human review, and not every review pattern is the same. The cadence patterns that work in production:
- Per-action review. Every irreversible action waits for explicit approval. Right for high-stakes domains (finance, healthcare, regulated industries) and early in deployment when trust is being built.
- Sampling review. A random or rule-based sample of actions are surfaced for human review post-action. Right for high-volume workflows where per-action review is unworkable.
- Threshold review. Only actions above a defined threshold (dollar value, blast radius, complexity) require approval. Right for mature deployments with measured trust.
- Audit-only. Actions execute autonomously; humans review the audit log periodically. Right for fully autonomous workflows after long periods of demonstrated reliability.
Most production deployments evolve through these patterns over the first year — starting per-action, moving to threshold, and arriving at audit-only for the lower-stakes portions of the workload while keeping per-action review for high-stakes paths.
State migration on schema changes
When you change the AgentState schema (add a new field, rename a field, change a type), existing checkpoints have the old shape. Without a migration story, in-flight conversations break.
Strategies that work:
- Additive changes only. Add new fields with default values. Old checkpoints work because the old fields are still there and new fields default.
- Migration script. When a breaking change is necessary, write a migration that walks all active checkpoints and rewrites them to the new schema. Run during a maintenance window.
- Versioned state. Include a schema version field in state. Nodes check the version and migrate at the entry point. More flexible, more code.
Chapter 8: Observability with LangSmith and OpenTelemetry
You cannot operate what you cannot see. Production agents fail in ways that are unique to AI systems: the model returns unexpected outputs, the tool calls cascade in surprising orders, the reasoning loops hit edge cases that no test caught. Observability is not optional.
LangSmith for trace-level visibility
LangSmith is LangChain’s hosted tracing platform. It captures every LLM call, every tool invocation, every node transition, with full input/output payloads. For LangGraph agents, the integration is essentially zero-config.
# Just set environment variables
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agent-system-prod
LANGSMITH_TRACING=true
# Agents automatically emit traces. View them at https://smith.langchain.com/
What LangSmith gives you: a complete tree view of each agent run, with each LLM call’s input, output, latency, and token counts; each tool call’s parameters and results; the full state evolution through the graph. When something goes wrong, you click into the failing run and see exactly where and why.
OpenTelemetry for system-wide metrics
LangSmith handles agent-specific traces. OpenTelemetry handles system-level metrics: requests per second, error rates, latency distributions, infrastructure utilization. Wire both:
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
# Tracing
trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer(__name__)
# Metrics
meter_provider = MeterProvider()
metrics.set_meter_provider(meter_provider)
meter = metrics.get_meter(__name__)
agent_latency = meter.create_histogram("agent_invocation_duration_ms")
tool_calls = meter.create_counter("agent_tool_calls_total")
# Use in code
with tracer.start_as_current_span("agent.invoke") as span:
span.set_attribute("user_id", user_id)
span.set_attribute("thread_id", thread_id)
start = time.time()
result = await app.ainvoke(inputs, config)
agent_latency.record((time.time() - start) * 1000)
span.set_attribute("messages_count", len(result["messages"]))
The dashboards that matter
Production agent operators monitor these dashboards daily:
- Successful run rate. Percentage of invocations that complete without error.
- p50/p95/p99 latency. End-to-end agent invocation time.
- Tool call distribution. Which tools are called most frequently? Which fail?
- Token consumption. Daily and per-user token usage; spending forecast.
- Iteration count distribution. Are agents looping more than expected?
- Approval queue size. Number of pending human-in-the-loop reviews.
Alerts that actually fire
Three alert categories cover most production incidents:
- Error-rate spike. Successful run rate drops below threshold (e.g., 95%) over a 10-minute window.
- Latency regression. p95 latency exceeds historical baseline by more than 50%.
- Token cost anomaly. Daily token spend exceeds 1.5x rolling average.
Alert fatigue is real. Tune thresholds carefully and investigate every fired alert; alerts that fire and get ignored stop being useful.
The metric correlation problem
Production agent metrics correlate in non-obvious ways. Three correlations to track:
- Latency vs token usage. Longer agent runs use more tokens. The correlation is approximately linear; outliers indicate either looping agents or unusually large tool results.
- Iteration count vs success rate. Agents that iterate 8+ times on tasks that typically need 3-4 are usually failing in subtle ways. Iteration count is a leading indicator of quality regression.
- Tool call distribution vs error rate. When the relative use of tools shifts, error rates often follow. A model update that changes which tools the agent prefers can mask underlying issues until error rates spike.
Build dashboards that surface these correlations, not just the raw metrics. The signal in the relationships often arrives before the signal in any single metric.
The trace inspection workflow
Production debugging follows a consistent pattern. When a user reports an issue or an alert fires:
- Pull the LangSmith trace for the affected run by thread ID
- Read the supervisor decisions: which specialists were invoked and why
- Drill into the failing specialist’s invocations: which tool calls produced what results
- Check whether the failure was an LLM hallucination, a tool error, or a deterministic logic bug
- Reproduce locally: replay the trace inputs through the local agent
- Fix and add to the eval set so the regression doesn’t return
The full loop typically takes 15-45 minutes for diagnosable issues. Issues that take longer are usually intermittent (a flaky tool, a model temperament shift) and need pattern-matching across many traces rather than deep dives on one.
Cost dashboards
Token cost is the single largest line item in most production agent deployments. A dedicated cost dashboard prevents surprise bills:
- Daily token spend by model. Track Claude Opus, Claude Haiku, GPT-5.x, Gemini, and any other models separately.
- Token spend by tenant. Identify which customers consume disproportionate resources. Useful for pricing decisions and abuse detection.
- Cost per agent invocation. Median, p95, p99. Anomalies often indicate looping agents or unexpected tool-call cascades.
- Forecast. Linear extrapolation from current trend. Aggressive growth shows up here before it shows up in the actual bill.
The evaluation harness
Beyond runtime observability, you need offline evaluation: a curated set of test prompts that the agent runs against on every code change. The pattern:
# evals/run_evals.py
import json
from src.agents.graph import app
EVAL_SET = json.load(open("evals/cases.json"))
def run_eval(case):
config = {"configurable": {"thread_id": f"eval-{case['id']}"}}
result = app.invoke({"messages": case["input"]}, config=config)
return {
"case_id": case["id"],
"expected": case["expected_outcome"],
"actual": result["messages"][-1].content,
"tools_called": [m.name for m in result["messages"] if hasattr(m, "name")],
"iteration_count": result.get("iteration_count", 0),
}
results = [run_eval(c) for c in EVAL_SET]
# Compare with previous run, fail CI if regression on key metrics
The eval set evolves with the system. Add new cases when you fix bugs (“don’t break this again”), when you ship new features (“validate this works”), and when you discover production failures (“regress against this from now on”). A 200-case eval set catches most regressions before they reach production.
Chapter 9: Authentication, Rate Limiting, and Tenant Isolation
Multi-tenant production agents need authentication, rate limiting, and tenant isolation built in from the start. Retrofitting these is painful; designing them in is straightforward.
Authentication at the API gateway
The agent system sits behind an API gateway (FastAPI, framework of your choice) that handles auth before the request reaches the graph:
from fastapi import FastAPI, Depends, HTTPException, Header
import jwt
import os
api = FastAPI()
JWT_SECRET = os.environ["JWT_SECRET"]
async def get_current_user(authorization: str = Header(...)) -> dict:
if not authorization.startswith("Bearer "):
raise HTTPException(401, "Missing bearer token")
token = authorization.removeprefix("Bearer ").strip()
try:
payload = jwt.decode(token, JWT_SECRET, algorithms=["HS256"])
except jwt.InvalidTokenError as e:
raise HTTPException(401, f"Invalid token: {e}")
return payload # {"user_id": "...", "tenant_id": "...", "role": "..."}
@api.post("/agents/start")
async def start(req: StartRequest, user: dict = Depends(get_current_user)):
config = {"configurable": {
"thread_id": req.thread_id,
"user_id": user["user_id"],
"tenant_id": user["tenant_id"],
}}
inputs = {**req.dict(), "user_id": user["user_id"]}
return await app.ainvoke(inputs, config=config)
Rate limiting per tenant
Without rate limiting, one tenant’s agent fleet can exhaust your model API quota and starve other tenants. Implement per-tenant rate limits at the gateway:
import redis.asyncio as aioredis
from fastapi import Request
redis = aioredis.from_url(os.environ["REDIS_URL"])
async def rate_limit(tenant_id: str, max_per_min: int = 60):
key = f"ratelimit:{tenant_id}:{int(time.time() // 60)}"
count = await redis.incr(key)
if count == 1:
await redis.expire(key, 60)
if count > max_per_min:
raise HTTPException(429, "Rate limit exceeded")
@api.post("/agents/start")
async def start(req, user=Depends(get_current_user)):
await rate_limit(user["tenant_id"], max_per_min=60)
# ... continue with invocation
Tenant isolation in MCP tool catalogs
Different tenants may have access to different tools. The MCP layer is the right place to enforce this:
# mcp-servers/code/server.py
@mcp.tool()
async def deploy_to_production(payload: dict, tenant_id: str = None) -> str:
"""Deploy code to production environment."""
if not tenant_id:
return "ERROR: tenant_id required"
if tenant_id not in PRODUCTION_DEPLOY_ALLOWLIST:
return "ERROR: tenant not authorized for production deployment"
# ... actual deploy logic
# In the agent's MCP client setup, pass tenant_id as a parameter
# that the tool requires. The model cannot bypass it because the MCP
# server enforces the check.
Per-tenant cost caps
Beyond rate limits on requests, set per-tenant token cost caps. A tenant whose agent runs amok can blow through significant token budget before rate limits notice:
# cost_guard.py
async def check_tenant_budget(tenant_id: str, daily_cap_usd: float) -> bool:
today = datetime.now().strftime("%Y-%m-%d")
key = f"tenant_spend:{tenant_id}:{today}"
spent = float(await redis.get(key) or 0)
if spent > daily_cap_usd:
return False
return True
async def record_tenant_spend(tenant_id: str, tokens_in: int, tokens_out: int,
model: str):
cost = compute_cost(tokens_in, tokens_out, model)
today = datetime.now().strftime("%Y-%m-%d")
key = f"tenant_spend:{tenant_id}:{today}"
await redis.incrbyfloat(key, cost)
await redis.expire(key, 86400 * 2) # 2 days
# Wire into the API gateway
@api.post("/agents/start")
async def start(req, user=Depends(get_current_user)):
if not await check_tenant_budget(user["tenant_id"], daily_cap_usd=50.0):
raise HTTPException(429, "Daily budget exceeded")
# ... continue
Data isolation
Beyond authentication, tenants need data isolation. The agent must not be able to see other tenants’ data through tool calls. Patterns that enforce this:
- Tenant ID propagation. Every tool call carries the tenant_id. Tools enforce that they only operate on data belonging to that tenant.
- Tenant-scoped MCP servers. For sensitive applications, run separate MCP server instances per tenant. The agent connects to the tenant’s instance based on the user’s identity.
- Database row-level security. Postgres RLS or equivalent enforces tenant isolation at the storage layer. Even if a tool bug leaks across tenants, the database refuses the access.
Audit logging
Every action an agent takes on behalf of a tenant should be auditable. Log the trinity: who (user_id, tenant_id), what (the action and parameters), when (timestamp), with the agent’s reasoning preserved as well.
import structlog
logger = structlog.get_logger()
def audit(user_id: str, tenant_id: str, action: str, **kwargs):
logger.info("agent_action",
user_id=user_id,
tenant_id=tenant_id,
action=action,
timestamp=time.time(),
**kwargs)
# Wrap each tool call with audit logging via a callback handler
class AuditCallback(BaseCallbackHandler):
def on_tool_start(self, serialized, input_str, **kwargs):
audit(self.user_id, self.tenant_id, "tool_call",
tool=serialized["name"], input=input_str)
Chapter 10: Production Deployment Patterns
The development environment is one machine. Production is several. This chapter covers the deployment patterns that scale agents from prototype to production.
The component topology
A production deployment of LangGraph + MCP has these components:
| Component | Scale | Deployment notes |
|---|---|---|
| API gateway | Multiple replicas behind load balancer | Stateless; FastAPI + uvicorn |
| Agent worker | Multiple replicas; same code as gateway | Stateless once checkpointer is shared |
| Postgres (checkpoints) | Single primary + read replicas as needed | State volume scales linearly with active threads |
| Redis (rate limit, cache) | Cluster or sentinel for HA | Most state is short-lived |
| MCP servers | Multiple replicas per server type | Stateless; scale independently |
| Observability stack | Hosted (LangSmith, Datadog, Honeycomb) | Self-host only if data residency requires |
Containerization
Both the agent service and each MCP server containerize cleanly. A representative Dockerfile:
# Dockerfile (agent service)
FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY src/ ./src/
ENV PATH="/app/.venv/bin:$PATH"
EXPOSE 8000
CMD ["uvicorn", "src.api.server:api", "--host", "0.0.0.0", "--port", "8000"]
Kubernetes deployment
For most teams running production agents, Kubernetes is the natural deployment target. The minimum manifests:
# k8s/agent-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata: {name: agent-service, namespace: agents}
spec:
replicas: 6
selector: {matchLabels: {app: agent-service}}
template:
metadata: {labels: {app: agent-service}}
spec:
containers:
- name: agent
image: registry.example.com/agent-service:0.42.0
envFrom: [{secretRef: {name: agent-secrets}}]
resources:
requests: {cpu: 500m, memory: 1Gi}
limits: {cpu: 2, memory: 4Gi}
livenessProbe:
httpGet: {path: /health, port: 8000}
periodSeconds: 30
readinessProbe:
httpGet: {path: /ready, port: 8000}
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata: {name: agent-service, namespace: agents}
spec:
selector: {app: agent-service}
ports: [{port: 80, targetPort: 8000}]
Scaling autoscaler signals
Standard CPU-based autoscaling does not work for agent workloads. CPU does not move when the bottleneck is the LLM API. Scale on a custom metric: pending requests per pod, or invocation rate.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: {name: agent-hpa, namespace: agents}
spec:
scaleTargetRef: {apiVersion: apps/v1, kind: Deployment, name: agent-service}
minReplicas: 4
maxReplicas: 64
metrics:
- type: Pods
pods:
metric: {name: pending_agent_invocations}
target: {type: AverageValue, averageValue: 8}
- type: Pods
pods:
metric: {name: active_thread_count}
target: {type: AverageValue, averageValue: 200}
The CI/CD pipeline for agent code
Agent code deployments need a more careful pipeline than typical service code. The pipeline that has emerged as the production default:
- Pull request. Engineer pushes code with required tests.
- Static checks. Type checking (pyright/mypy), linting (ruff), formatting validation.
- Unit tests. Mock-LLM tests for individual nodes.
- Integration tests against real LLMs. The eval suite runs with the actual model APIs. Expensive but high-signal.
- Prompt regression check. If any prompt changed, run the prompt regression suite. Failures block the merge.
- Code review. Especially for prompt changes — at least one reviewer with prompt-engineering instincts.
- Canary deployment. Merge triggers canary deploy to 5% of production traffic.
- Automated rollback gate. Error rate, latency, and cost metrics monitored. Auto-rollback if any breaches threshold.
- Promote. If canary holds for 2 hours, promote to 100%.
The eval suite running on every PR is the highest-leverage piece. It is also the most controversial because it costs real money — typical eval runs spend $5-20 per pipeline. Most teams accept this; the cost of a bad agent reaching production is much higher.
The hosted LangGraph option
Beyond self-hosting, LangChain offers hosted LangGraph through LangGraph Cloud. The trade-offs versus self-hosting:
| Concern | Self-hosted | LangGraph Cloud |
|---|---|---|
| Time to production | 4-8 engineer-weeks | Days |
| Infrastructure cost | Pay for compute, DB, observability separately | Bundled |
| Operational responsibility | Your team owns it 24/7 | LangChain ops team handles infra |
| Customization | Full control | Constrained by platform conventions |
| Data residency | Wherever you deploy | Limited to LangChain’s regions |
For early-stage products and small teams, hosted is often the right choice — faster to ship, less operational burden, costs predictable. For larger organizations with strict data residency or compliance constraints, self-hosting wins on flexibility. Most teams start hosted, migrate to self-hosted when scale or compliance demands it.
Database schema for checkpoints
Postgres checkpointing produces a specific schema. Knowing it helps with operational tasks (backup, archival, debugging):
| Table | Purpose | Operational notes |
|---|---|---|
| checkpoints | One row per checkpoint, indexed by thread_id and timestamp | Grows with active threads × checkpoints per run |
| checkpoint_writes | Pending writes between checkpoints | Cleared after checkpoint commits |
| checkpoint_blobs | Large binary state values | Use TOAST for large entries |
Operational considerations: archive completed-thread checkpoints after 30-90 days to control table size. Index thread_id and created_at on the checkpoints table; queries by thread are common. Use partitioning by month if your scale exceeds 10M checkpoints per month.
The MCP server fleet
Production MCP servers form a fleet. Three operational considerations:
- Independent scaling. Each server type scales based on its own load. The web-search server may need 20 replicas at peak; the database-tool server may need 4. Don’t co-locate; let HPAs scale each independently.
- Versioned deployments. Each MCP server is a service. Deploy with the same blue-green or canary discipline as the agent service. Tool contract changes are user-visible behavior changes.
- Health checks specific to MCP. The MCP protocol has a built-in
initializehandshake. Your readiness probe should exercise it, not just check that the HTTP server is up.
Blue-green and canary deployments
Agent code changes can break in subtle ways: a new prompt produces unexpected outputs, a graph refactor breaks state migration, an MCP server upgrade changes a tool signature. Deploy carefully:
- Canary first. 1-5% of traffic to the new version, automated rollback if error rate spikes.
- Compare metrics. Old and new versions on shared dashboards; watch for divergence.
- Hold canary at least 4 hours. Many edge cases only surface across a full traffic distribution.
- Promote or rollback decisively. Half-rolled-out deployments are operational nightmares.
Chapter 11: Cost, Latency, and Reliability Optimization
An agent that works is the start. An agent that works at acceptable cost, latency, and reliability is the goal. Three optimization dimensions matter.
Token cost optimization
The dominant cost in most agent deployments is LLM tokens. The optimization techniques that consistently produce material savings:
- Cheaper models for cheaper roles. Use Claude Haiku or GPT-4o-mini for the supervisor and trivial tool calls; reserve Claude Opus or GPT-5.5 for reasoning-heavy specialist work.
- Prompt caching. Both Anthropic and OpenAI offer prompt caching with 50-90% discounts on cached tokens. Structure prompts so static portions (system prompts, tool definitions) come first; cache them.
- Context pruning. Long conversations accumulate state. Periodically summarize older messages and replace them with the summary.
- Tool result truncation. Web search results, file contents, and SQL outputs are often long. Truncate aggressively in the tool layer; the model rarely needs more than the first few KB.
- Iteration limits. Cap the agent’s loop count. A runaway agent hitting 50 iterations costs 10x what a normal 5-iteration run costs.
# Iteration cap pattern
from langgraph.graph import StateGraph
class State(TypedDict):
messages: list
iteration_count: int
def agent_node(state):
if state["iteration_count"] >= 20:
return {"messages": [AIMessage(content="Iteration limit reached.")]}
response = llm.invoke(state["messages"])
return {"messages": [response], "iteration_count": state["iteration_count"] + 1}
Latency optimization
Agent latency is dominated by the LLM call. Mitigations:
- Streaming. Surface tokens to the user as they generate. Perceived latency drops dramatically.
- Parallelism for parallelizable subtasks. Multi-agent workflows can dispatch independent specialists concurrently.
- Speculative tool execution. When the LLM is likely to call a particular tool, start the tool call before the LLM finishes deciding. Cancel if wrong.
- Provider colocation. Run the agent in the same region as the LLM provider’s nearest endpoint. Cross-region latency adds up across multi-call workflows.
The cost optimization decision tree
When agent costs need to come down, the levers in priority order:
- Iteration cap. Hard cap looping agents. Highest-impact, lowest-risk single change.
- Prompt caching. Reorganize prompts so static portions cache. 30-50% cost reduction at the model API layer.
- Model right-sizing. Move supervisor and lightweight roles to smaller, cheaper models. 20-40% additional reduction.
- Context pruning. Summarize older turns. 10-20% reduction on long conversations.
- Tool result truncation. Cap the size of strings returned from tools. 5-15% reduction.
- Caching at the application layer. Last because the operational complexity is highest. 10-30% reduction on workloads that benefit.
Apply in order. Most teams hit their cost target after the first three changes; the rest are diminishing returns.
Caching strategies
Beyond prompt caching at the model layer, application-level caching reduces tool-call costs:
- Semantic caching. If two queries are semantically similar, return the cached answer. Useful for FAQ-style workloads. Implementations: Redis with embedding-based key matching, GPTCache, custom solutions.
- Tool result caching. Web searches, file fetches, and API calls often return the same result for the same input. Cache with TTL appropriate to the data freshness needs.
- Negative caching. Cache “this query returned no useful results” so repeated identical failures don’t repeatedly hit the backend.
Be careful with caching in agent contexts. Stale cache returns can confuse the LLM into thinking it has fresh data. Set TTLs aggressively for time-sensitive content (news, prices, schedules) and conservatively for stable content (documentation, definitions).
Reliability patterns
Production agents fail. The patterns that absorb failures gracefully:
- Retries with exponential backoff. Tool calls and LLM calls retry on transient errors. Idempotency-key tools to prevent double-execution on retry.
- Circuit breakers. If a tool fails 5 times in 30 seconds, skip it for the next 60 seconds rather than retrying. Avoid cascading failures.
- Timeouts at every layer. Tool calls timeout after 15 seconds; LLM calls timeout after 60 seconds; full agent runs timeout after 5 minutes. Make these configurable per tenant.
- Graceful degradation. When the supervisor LLM is unavailable, fall back to a simple deterministic router. When the code agent’s MCP server is down, the supervisor knows to skip it and respond with what it has.
The latency budget
Allocating latency budget across the agent’s components prevents one slow component from dominating. A working budget for a typical multi-agent invocation:
| Component | Latency budget | Optimization lever |
|---|---|---|
| API gateway auth + routing | 50ms | Caching, lightweight middleware |
| Supervisor LLM call | 1.5s | Use a fast small model (Haiku, Mini) |
| Specialist LLM call (each) | 3-5s | Streaming, prompt caching |
| MCP tool call (each) | 500ms | Connection pool, async parallelism |
| Checkpointer write | 50ms | Async write, batched commits |
| Total wall clock (5 turns) | ~12-15s | End-to-end optimization |
Streaming changes the user-perceived latency dramatically: even a 15-second total run feels responsive when tokens stream as soon as they generate. Without streaming, the same run feels broken.
The cost of reliability
Reliability has a cost. Retries multiply token spend. Timeouts force re-runs. Circuit breakers reduce capability. Track these in your observability and tune thresholds based on actual production patterns. The right balance depends on your use case: a customer-facing chat needs higher reliability than a background batch job.
Chapter 12: Pitfalls, Case Studies, and What’s Next
Six pitfalls account for most production agent failures in 2026. Knowing them in advance saves significant operational time.
The full FastAPI gateway pattern
For reference, a complete production-pattern API gateway pulling together authentication, rate limiting, cost guarding, agent invocation, and observability:
from contextlib import asynccontextmanager
from fastapi import FastAPI, Depends, HTTPException, Header, Request
from langgraph.checkpoint.postgres import PostgresSaver
from langchain_mcp_adapters.client import MultiServerMCPClient
import os, time, structlog
logger = structlog.get_logger()
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: open MCP client, set up checkpointer, compile graph
app.state.mcp = MultiServerMCPClient({...})
await app.state.mcp.__aenter__()
app.state.checkpointer = PostgresSaver.from_conn_string(
os.environ["POSTGRES_URL"]).__enter__()
app.state.checkpointer.setup()
app.state.graph = build_graph(app.state.mcp.get_tools()).compile(
checkpointer=app.state.checkpointer)
logger.info("agent_service_ready")
yield
# Shutdown: close MCP client, close DB
await app.state.mcp.__aexit__(None, None, None)
app.state.checkpointer.__exit__(None, None, None)
api = FastAPI(lifespan=lifespan)
@api.post("/agents/start")
async def start(req: StartRequest, request: Request,
user: dict = Depends(get_current_user)):
if not await check_tenant_budget(user["tenant_id"], 50.0):
raise HTTPException(429, "Daily budget exceeded")
await rate_limit(user["tenant_id"], 60)
config = {"configurable": {
"thread_id": req.thread_id,
"user_id": user["user_id"],
"tenant_id": user["tenant_id"],
}}
inputs = {**req.dict(), "user_id": user["user_id"]}
logger.info("agent_invocation_start",
thread_id=req.thread_id,
tenant_id=user["tenant_id"])
start_time = time.time()
try:
result = await request.app.state.graph.ainvoke(inputs, config=config)
except Exception as e:
logger.exception("agent_invocation_failed", error=str(e))
raise HTTPException(500, "Agent invocation failed")
duration_ms = (time.time() - start_time) * 1000
logger.info("agent_invocation_complete",
thread_id=req.thread_id,
duration_ms=duration_ms,
messages=len(result["messages"]))
return {"status": "complete", "result": result}
Pitfall 1: Tool sprawl
Teams add tools opportunistically until the prompt is bloated with 40+ tool definitions. The model’s tool-selection accuracy degrades sharply past 15-20 tools. Discipline tool catalogs by specialist; do not give every agent every tool.
Pitfall 2: Stateless thinking in stateful systems
Engineers new to LangGraph write nodes that treat each invocation as fresh. Then they discover the state machine actually accumulates state across runs. The result: bugs that only show up after the second invocation. Treat state as an explicit accumulator from day one.
Pitfall 3: Mixing agent code with business logic
Tempting to put validation, formatting, and business rules inside agent nodes. Resist. Business logic belongs in tools (MCP servers) or in deterministic post-processing. Keeping nodes thin makes the system testable.
Pitfall 4: Inadequate observability
Teams ship agents with logging-only observability. The first production incident takes 6 hours to diagnose because there are no traces, no metrics, no inspection points. Wire LangSmith and OpenTelemetry from day one; observability is not a phase-2 task.
Pitfall 5: Over-trusting LLM tool selection
The supervisor or specialist LLM occasionally makes wrong tool calls. Without verification, those wrong calls execute. Add deterministic guards: validate parameters before tool execution, require approval for high-blast-radius actions, log and alert on unusual call patterns.
Pitfall 6.5: Underestimating prompt-injection risk
Agents that consume external content are vulnerable to prompt injection: an attacker plants instructions in web pages, emails, or files that the agent fetches, and those instructions trick the agent into doing something unintended. Production mitigations:
- Treat external content as data, not instructions. System prompts explicitly state that text fetched from external sources is data to be analyzed, not instructions to be followed.
- Limit blast radius. Even if a tool call is hijacked, the worst it can do is bounded by the tool’s own permissions. Don’t give the agent tools that can do irreversible damage on a single call.
- Output filters. Run agent outputs through a deterministic validator before they affect the world. Catches obvious anomalies.
- Human-in-the-loop for high-stakes actions. The simplest defense: a human reviews any irreversible action before commit.
The migration path between LangGraph versions
LangGraph evolves. Major version updates can include breaking changes. The migration discipline that keeps things smooth:
- Pin versions in production. Never run on a moving target. Update deliberately.
- Read changelogs. The release notes describe breaking changes with migration paths. Take them seriously.
- Migrate in staging first. A full deployment cycle in staging surfaces issues before production traffic.
- Update incrementally. Don’t skip multiple major versions. Migrating 0.2 → 0.4 directly is harder than 0.2 → 0.3 → 0.4.
- Eval-suite-as-migration-test. Run the full eval suite before and after each upgrade. Differences indicate migration issues.
Pitfall 6: Underestimating prompt evolution
System prompts evolve. Each evolution can change agent behavior in unexpected ways. Treat prompts like code: version them, test them, deploy them with the same care as code changes. A prompt regression suite that runs on every change catches most surprises.
Case study: a customer-support agent at scale
A SaaS company built a customer-support agent using LangGraph + MCP. The agent handles 80% of incoming tickets without human escalation. Architecture:
- Supervisor routes between three specialists: knowledge-base lookup, account-action handler, and escalation
- MCP servers for KB search, account API, and ticketing system integration
- Postgres checkpointer holds conversation state across multi-turn interactions
- Human-in-the-loop interrupt before any account-modifying action
- LangSmith for trace visibility, OpenTelemetry for system metrics
Operational results after 6 months:
| Metric | Pre-agent | Post-agent |
|---|---|---|
| Average resolution time | 4.2 hours | 23 minutes (auto-resolved) |
| Escalation rate to human | 100% | 20% |
| Cost per ticket | $8.40 | $1.10 |
| CSAT score | 3.9 / 5 | 4.4 / 5 |
Case study: a developer-tool company’s coding agent
A developer-tool company built a coding agent that helps developers debug build errors. The agent ingests a build log, optionally fetches relevant repo files, and proposes fixes. Architecture:
- Single-agent ReAct loop (no multi-agent supervisor — the workflow is narrow enough)
- MCP server exposing repo file access, build log retrieval, and a sandboxed code-execution tool
- Anthropic Claude Opus 4.7 as the reasoning model
- Strict iteration cap (12 turns) and 90-second wall clock limit
- LangSmith for trace inspection during development; OpenTelemetry for production metrics
Operational results after first quarter in production:
| Metric | Value |
|---|---|
| Build errors auto-resolved | 43% (target was 30%) |
| Median resolution time | 22 seconds (vs 14 minutes manual) |
| Cost per invocation | $0.12 (Claude Opus + tooling) |
| Developer satisfaction | 4.6 / 5 from team surveys |
The lesson the team highlighted: starting with a single-agent ReAct pattern and proving it worked before adding supervisor complexity saved months of architectural overengineering. They added a multi-agent layer six months later when the workflow expanded to cover both build errors and runtime errors with different specialist tooling.
Case study: an internal-research agent
A consulting firm built an internal-research agent for analysts. Multi-agent supervisor with research, data, and writing specialists. The system processes 200-300 research requests per day, replacing what was previously manual web research and document drafting.
Key implementation choices: Anthropic Claude Opus for the writing specialist, Haiku for supervisor, and a SQL-aware data specialist with read-only access to the firm’s internal databases via MCP. Human approval required before any external sharing of research output. Result: typical research task time dropped from 4-6 hours to 35-50 minutes of analyst review time on AI-drafted output.
What’s next: A2A and the inter-agent protocol
The 2026-2027 horizon brings agent-to-agent protocols. Google’s A2A specification, Anthropic-led MCP extensions for inter-agent communication, and other proposals are converging on standards for agents from different organizations to call each other. Production implications:
- Composable agent ecosystems. Your customer-support agent could call a third-party payments agent without bespoke integration.
- Authentication and trust frameworks. Cross-organization agent calls need identity, authorization, and reputation infrastructure.
- Liability questions. When my agent calls your agent and something goes wrong, whose responsibility?
Watch this space. The standards are not yet stable enough to bet production architecture on, but they will be by late 2027.
What’s next: dynamic graph generation
Current LangGraph systems use graphs the engineer designs. The 2027 frontier is graphs that an LLM generates dynamically based on the task at hand. Early experiments show LLMs that compose specialist agents into ad-hoc graphs performing well on novel tasks where the engineer didn’t anticipate the topology.
Production caveats are real: dynamically generated graphs are harder to reason about, harder to test, and harder to debug. Early adopters use dynamic generation for exploration and freeze the resulting graph once the pattern stabilizes. The full “agent that writes its own graph for each task” pattern is research-grade in 2026 and will become production-grade as observability and debugging tools mature.
What’s next: longer-running agents
Today’s production agents complete within minutes. The next frontier is agents that run for hours, days, or longer — autonomous workers that take on substantial bounded projects. Examples already shipping in 2026:
- Software engineering agents that take a Jira ticket and submit a pull request 4-6 hours later
- Research agents that produce a comprehensive report on a topic over the course of a workday
- Data analysis agents that explore a dataset, develop hypotheses, and write up findings
Long-running agents need architectural changes: robust checkpointing across hours of state, resumability across infrastructure events, and human-in-the-loop check-ins at sensible cadence rather than at every step. LangGraph + MCP is well-suited to this — checkpointing is first-class — but the operational discipline scales.
Case study: a finance back-office automation agent
A mid-market accounting firm built an agent that processes vendor invoices: extracts invoice data, validates against purchase orders, flags discrepancies, and queues approved invoices for payment. Architecture choices:
- Multi-agent supervisor with extraction, validation, and notification specialists
- MCP servers connecting to the firm’s ERP, document storage, and approval workflow systems
- Mandatory human-in-the-loop interrupt before any payment authorization
- Postgres checkpointer with 90-day retention for audit purposes
- Comprehensive audit logging of every agent decision
Operational results in the first six months:
| Metric | Before agent | After agent |
|---|---|---|
| Average invoice processing time | 3.4 days | 11 hours |
| Discrepancy catch rate | 78% (missed by humans) | 97% (agent flags) |
| AP team headcount required | 6 FTE | 3 FTE + agent oversight |
| Cost per invoice processed | $8.20 | $1.40 |
The two AP team members repurposed to higher-value work (vendor relationship management, financial planning) reported higher job satisfaction. The agent did not eliminate jobs; it changed what those jobs looked like.
Pitfall 7: Insufficient context window planning
Long agent runs accumulate messages, tool results, and intermediate reasoning. Eventually the context window fills and the LLM loses access to earlier turns. The patterns that prevent this:
- Summarization checkpoints. Every N turns, replace older messages with a structured summary. Preserves the key information without the token bloat.
- Selective context. Specialists may not need the full conversation history. Pass them only the relevant slice.
- Tool result trimming. Older tool results often don’t matter for current reasoning. Drop them after a few turns.
- Context-window monitoring. Alert when an agent’s context exceeds 80% of the model’s window. Approaching the limit predicts behavior degradation.
Pitfall 8: Treating prompts as static
Prompts written once and never revisited rot. Models change, user behavior changes, edge cases emerge. The prompt that worked at launch may produce subtly worse results six months later. Build a prompt review cadence: quarterly review of system prompts, comparison against current state-of-art, deliberate evolution rather than reactive change.
The interaction with traditional software services
Production agents do not exist alone. They interact with traditional software services — databases, queues, web APIs, monitoring stacks. The patterns that keep these interactions clean:
- Agents call services through MCP tools, not directly. The MCP layer keeps agent code clean and changes in service interfaces don’t ripple into agent code.
- Services don’t call agents synchronously. Agent latency is variable. Synchronous calls cause user-facing services to inherit that variance. Use async patterns: queue-based handoffs, webhooks for completion.
- Idempotency keys for agent-initiated actions. When the agent retries (which it will), idempotency keys prevent duplicate side effects.
- Circuit breakers between agent and service. If the service is down, fail fast and let the agent recover. Don’t let downstream service failure cascade into agent timeouts.
Multi-region and edge deployments
2026 deployments are mostly single-region. By late 2027, expect multi-region active-active agent deployments to be standard for global products. Considerations:
- Checkpoint replication across regions (Postgres logical replication, distributed Redis, custom sync)
- Region-aware MCP discovery (route to nearest server)
- Data residency requirements that pin certain tenants to specific regions
- Failover patterns for region outages
The agent governance question
Beyond the operational concerns, organizations running multiple agents need governance frameworks. Questions that come up:
- Who approves a new agent for production deployment?
- What review process applies to a new MCP tool?
- Who is accountable when an agent makes a costly mistake?
- What policies govern data the agent sees and the actions it takes?
- How are model changes (a new Claude version, a new GPT version) evaluated and adopted?
The governance pattern that works: a small standing committee (representation from engineering, product, security, compliance) reviews new deployments quarterly. Documented standards apply to new agents. Specific deployments above a risk threshold (financial, customer-facing, regulated) get individual review. Below that threshold, the standards apply automatically. This balances speed with responsibility.
Building the agent platform team
Organizations running multiple production agents benefit from a dedicated agent platform team. The team’s responsibilities:
- Shared infrastructure. Common LangGraph runtime, MCP discovery, observability stack, security review process.
- Reusable specialist agents. Common research agents, code agents, communication agents that product teams can compose into their workflows.
- MCP server marketplace. Internal catalog of approved MCP servers that product teams can pull from.
- Pattern library. Reference implementations of common patterns (single-agent ReAct, supervisor, human-in-the-loop, long-running) that product teams adapt.
- Standards and review. Code review for new agent deployments, security review for new tools, capacity planning for shared infrastructure.
Team size scales with the number of production agents. A team of 4-6 engineers can support 8-15 production agent applications across an organization. Past 15 applications, expect to grow the platform team or push more responsibility to product teams with platform-team mentorship.
Documentation discipline
Agent systems are particularly punishing for poor documentation. The state schema, the supervisor’s routing logic, the MCP servers’ tool catalogs, the prompt templates — all need readable, current documentation. Patterns that work:
- Inline docstrings. Every node, every tool, every state field has a docstring explaining its purpose.
- Architecture decision records. Major architectural choices captured in version-controlled ADRs. Why the supervisor pattern, why these specialists, why this checkpoint backend.
- Runbooks. For each known operational scenario, a runbook describing diagnosis and remediation. The on-call engineer’s lifeline.
- System diagram. Updated quarterly. Shows the agent service, MCP servers, checkpoint storage, observability stack, and how they relate.
Documentation that is allowed to rot creates incidents disproportionate to the documentation cost. Make documentation a first-class output of every agent project, reviewed in code review, kept current as the system evolves. Treat documentation gaps as defects — they will absolutely surface as on-call burden when nobody on the team remembers why a particular routing rule exists, or what was supposed to happen when an MCP server returned a specific error code.
The investment in documentation pays back continuously. New team members ramp up faster. On-call engineers diagnose issues without paging the original author. Architecture reviews surface trade-offs without re-discovering them. The bar to be clearly above: someone joining the team in month nine should be able to read the docs and understand the system within their first week. If your documentation passes that test, the rest of the operational discipline tends to follow naturally because you are forced to think clearly about what the system actually does. The documentation discipline is the operational discipline made legible to other humans.
Build that legibility deliberately and the agent system stays maintainable as the team and the system both grow — and as the model layer continues to evolve underneath.
The platform team’s flywheel
A working platform team produces a flywheel: each successful agent deployment makes the next one cheaper to ship. The mechanics:
- Reusable patterns accumulate. The first deployment establishes the supervisor pattern; the fifth deployment reuses it without thinking.
- Shared MCP servers grow the toolbox. Each new MCP server becomes available to subsequent agents. Year-two agents have 3-4x the tool selection of year-one agents.
- Operational knowledge compounds. Incidents on the first agent teach lessons that prevent the same incidents on the tenth agent.
- Recruitment becomes self-reinforcing. Engineers want to work where production agents are shipping. The team becomes a magnet for the talent that grows the team further.
Building this flywheel is the platform team’s most valuable contribution. The first three agents are the hardest; agents four through ten ship dramatically faster on the same team.
Hiring for agent engineering
The skills profile for agent engineers differs from traditional backend engineers. The competencies that matter:
- Strong async Python (or TypeScript) — agents are deeply async
- Distributed systems familiarity — checkpointing, retries, idempotency
- API design instincts — MCP tool design is API design
- Systems thinking on prompts — treating prompts as code, with versioning and testing
- Operational mindset — observability, debugging, incident response on AI systems
The hardest hires in 2026 are engineers who combine all five. Expect to grow most of these skills internally rather than hire fully-formed. Pair junior engineers with senior mentors and give them small agent projects with high learning value before assigning critical production deployments.
Open-source vs proprietary models in agent stacks
The choice between proprietary frontier models (Claude, GPT, Gemini) and open-weights models (Llama, Mistral, Qwen, DeepSeek) for agent stacks comes up regularly. Three considerations:
- Capability gap on tool use. Frontier proprietary models still lead on tool selection accuracy and reasoning quality, especially for multi-tool, multi-step workflows. Open-weights models are catching up but lag by 12-18 months on this specific dimension.
- Cost-quality trade-off. Open-weights models running on your own infrastructure can be cheaper at high volume — break-even is typically around 10-50 million tokens/day depending on infrastructure choices.
- Privacy and data residency. If your data cannot leave your environment, open-weights running on your hardware is the only path. Frontier proprietary models with HIPAA-compliant API endpoints solve some but not all of this.
The hybrid pattern that works in 2026: use proprietary frontier models for the reasoning-heavy roles (specialists), use open-weights or smaller proprietary models for the lightweight roles (supervisor, classification). The cost savings are meaningful and the capability hit on the lightweight roles is small.
The closing posture
LangGraph + MCP is the production agent stack of 2026 because it solves the right problems with the right abstractions. State is explicit. Tools are decoupled. Human-in-the-loop is first-class. Observability is straightforward. Scaling to multi-tenant production is well-understood.
Teams that build agents on this stack ship faster, debug more easily, and operate more reliably than teams that pick alternative architectures. The chapters above are the working playbook. Build the foundation right, evolve the capability layer as your product needs grow, and operate the system with the discipline that any production system requires. The agentic era is here, and the engineering practices that worked for traditional software services apply — with the additional disciplines that AI introduces.