5 Things I Learned Running LangGraph Multi-Agent in Production for a Year
I spent the past year building and operating a LangGraph multi-agent system in production — an AI assistant for a B2B device-management platform. It went through a full single-agent → multi-agent → rewrite arc, and a lot of what I “learned” was really just discovering which design decisions hold up past week 40.
This isn’t a war-stories post. It’s the five design patterns I’d reach for on day one of the next agent app, with the reasoning that earned them. Five axes: orchestration, state & history, control flow, scheduled work, and RAG.
1. Orchestration: start with one agent and skills, not a fleet of sub-agents
The textbook pattern is a supervisor routing to specialized child agents. We built exactly that — agents for diagnostics, config, logs, inspection — and it scaled badly: every supervisor→child→supervisor hop re-serializes context into the next LLM call, child agents sharing a state dict produced duplicated history and ordering bugs, and “who owns this request” became a routing problem.
We collapsed it into a single deep agent that calls tools directly and loads skills — folders of markdown (instructions + example commands) discovered at runtime:
agent = create_deep_agent(
model=get_llm(),
tools=[save_memory, schedule_task, ...], # a few cross-cutting tools
subagents=[], # ← no child agents
skills=["/skills"], # ← capabilities = markdown files
middleware=[GreetingFastPath(), InjectContext(), TokenGuard()],
)
Two things made this a durable win, not just a refactor:
- Capabilities became data, not code. Adding a skill is writing a
SKILL.mdfile, not shipping a new agent class. The unit of extension matches the unit of change. - Skill granularity follows the user’s mental model. One skill = one complete task the model can finish in a single reasoning pass (e.g. “diagnose an offline device”, with the full command sequence inside). We learned the hard way that splitting one workflow into many micro-skills just multiplies tool-call round-trips.
And note the middleware stack — GreetingFastPath (skip the LLM entirely for “hi”), InjectContext (timezone, tenant, memories), TokenGuard (more on that below). Middleware is for cross-cutting concerns; business logic stays in skills and prompts. A middleware that starts encoding “if the device is offline, run diagnostics” is a middleware that will fight you in three weeks.
Reach for a sub-agent only when you have a real need for parallelism or hard context isolation. Until then, one agent + tools + skills is flatter, cheaper, and has one state to reason about. Multi-agent is a tool, not a goal.
2. State & history: separate execution state from durable history
This is the single most important data decision in the whole system, and the easiest to get wrong. Early on we let LangGraph’s checkpointer and our own app-level store both believe they owned the conversation. The result was history that doubled on every retry and a checkpoint blob that eventually blew past MongoDB’s 16MB document limit.
The fix is a clean ownership split:
- Durable history — what the user sees, what you audit — lives in the application layer, modeled like Langfuse: a
Session(metadata + ordered list of trace IDs, no message bodies) andTraces (one request each: input, output, intermediate tool calls). Permanent. Queryable. Filtered by user/tenant on every read. - Execution state — the checkpointer — holds only the minimum needed to resume an in-flight run. It’s allowed to forget. It defends itself with a trim:
def trim_to_fit(messages, blob_size):
if blob_size <= LIMIT:
return messages
keep = max(MIN_KEEP, int(len(messages) * LIMIT / blob_size * 0.8))
while serialized(messages[-keep:]) > LIMIT and keep > MIN_KEEP:
keep //= 2
return sanitize_boundaries(messages[-keep:]) # drop orphaned tool messages
The pieces that make this work in practice:
- Rebuild context per request from durable history, not from whatever the checkpointer happens to still hold. The checkpointer trimming is then invisible to the user.
- Design state for deletion from the start. Trimming will leave orphaned
ToolMessages with no matchingAIMessage— sanitize the boundaries or the next LLM call errors. If you keep a summarization cutoff index, adjust it when you trim. - Keep the state schema lean. Ours holds
page_context,timezone,platform— and deliberately not JSON schemas or tool definitions, which we moved to on-demand DB lookups to keep the blob small. - Cancel vs. regenerate are different operations. Cancel = mark a flag (don’t destroy evidence). Regenerate = physically delete that trace and everything after it, then trim the checkpoint by the same count. Get this wrong and resumed runs replay stale turns.
A checkpoint is a stack frame, not an archive. Decide which layer owns durable history, give the other an eviction policy, and rebuild context from the durable side every turn.
3. Control flow: design explicit exits, and make interrupt/resume a first-class mechanism
Loops. Our nastiest hang was an agent that retried forever until LangGraph killed it with GraphRecursionError. The root cause wasn’t the recursion limit being too low — it was a tool that failed silently: given empty input it returned "Error: preferences is empty", the model didn’t know what to do differently, so it retried identically. The fix wasn’t a bigger limit; it was giving the model a usable exit:
async def save_preferences(preferences: dict | None = None,
description: str | None = None) -> str:
if not preferences and description:
preferences = {"description": description} # accept a fallback shape
if not preferences:
# actionable failure → the model can change strategy, not just retry
return "Error: pass either a `preferences` dict or a `description` string."
...
Recursion limits are a backstop, not a control-flow strategy. Any node that can route back to itself needs a designed exit: a budget, a fallback, and failures phrased as instructions.
Pauses. The other half of control flow is stopping cleanly and coming back. We needed this because access tokens expire mid-run — a long inspection would 401 halfway through. Instead of dying, a middleware turns expiry into a pause/resume cycle on top of the checkpointer:
class TokenGuard(AgentMiddleware):
async def awrap_tool_call(self, request, handler):
try:
return await handler(request)
except TokenExpiredError:
interrupt({"type": "token_expired"}) # checkpoint + pause
return await handler(request) # resume → token is fresh
The reusable insight: interrupt() + a durable checkpointer is a general “pause this run and come back, possibly on another pod” primitive — not just a token-refresh trick. The same shape covers human-in-the-loop approval, long-task confirmation, and resource backpressure. The interrupt payload just carries a type the frontend dispatches on.
Two traps worth knowing: frameworks love to wrap tool execution in except Exception, which will swallow the GraphInterrupt you just raised (audit for it — we had to let GraphBubbleUp re-raise). And persist the pending-resume state before you emit the interrupt event, or a fast client resumes against data that isn’t there yet.
Don’t tune the recursion limit; design the exit. And treat interrupt/resume as core infrastructure — it’s what makes everything else recoverable.
4. Scheduled & background work: isolate every run, and mind identity
Letting users schedule recurring agent runs (“inspect my devices every morning”) looks like a small feature and is actually where the subtle bugs live, because background execution loses all the request-scoped context you took for granted.
A scheduler loop persists jobs (cron/daily/weekly/interval) to the database and wakes periodically to run what’s due. The patterns that mattered:
- Each run is fully isolated. Fresh
session_id, fresh trace, its own execution context — never reuse the originating chat session, or scheduled output contaminates the user’s live history. - Catch up on restart. On startup, reload active jobs and immediately fire anything already overdue. A pod restart should never silently skip a run.
- Prepare every context variable the run depends on, up front. This is the part that bites. In a live request, identity, tenant, locale, and auth tokens all arrive on the request and sit in context for free. A scheduled run has none of that — the token that authorized “create this job” is long expired by the time it fires. Before the agent executes, you have to reconstruct the context it will read from: capture whatever identity and scoping the job needs at creation time, persist it on the job record, and re-establish it (plus a freshly minted service token) when the job runs. Auditing exactly which variables your tools read from context — and pinning each one into the job — is the single highest-leverage thing you can do here.
- Always advance
next_run_atand recordlast_error, even on failure. Partial failure (the agent succeeded but the notification didn’t) should be logged, not fatal. A job that doesn’t reschedule itself on error is a job that runs once.
Background work strips away request context. Make every run idempotent enough to retry, and treat “rebuild the context the agent expects” as an explicit step — the failure mode isn’t a crash, it’s a job that runs with the wrong scope.
5. RAG: fit the retrieval to the domain, expose it as a tool, and let evals drive it
We started with the default — embeddings into a vector DB — and eventually moved the main knowledge base to jieba + BM25 with a custom dictionary. For a vertical corpus full of exact terminology (device models, protocol names, config keys), keyword precision beats semantic similarity, costs no embedding round-trips, and is debuggable (you can see the score). A domain dictionary keeps ER805 and VLAN as single tokens instead of being split into noise. We still use vector search where it fits — semantic lookups over structured config schemas — so the lesson isn’t “BM25 wins,” it’s match the retrieval method to the corpus, and don’t assume vector search by default.
Three patterns carried the quality:
- RAG is a tool the agent chooses to call, not a fixed pipeline step. The model decides whether a question needs documents at all (“what time is it” doesn’t), which saves tokens and latency. Retrieval, query rewriting, and evaluation are separate concerns — rewriting and retrieval are callable tools; evaluation runs async afterward.
- Query rewriting is terminology translation, not question improvement. Its job is to turn user phrasing into corpus phrasing — “guest network” → “VLAN”, “can’t get online” → “WAN / cellular / DNS / route”. Then fuse multiple rewrites with reciprocal-rank fusion so different phrasings of the same need all hit. Prioritize matching the docs’ own headings over creative diversity.
- For structured data, constrain the query instead of embedding everything. A config schema can be 10k+ tokens. A tool that takes a path (
system.ntp) and returns just that fragment keeps the prompt small and the answer grounded — a far better pattern than dumping the whole schema into context.
Finally, let metrics drive iteration, not vibes. We score a sample of traffic with RAGAS (faithfulness, answer relevancy, context precision) and write the scores back onto the trace. Low faithfulness → retrieval is weak or k is too high; low relevancy → rewriting needs work; low precision → chunking or k needs tuning. Without the feedback loop, every RAG change is a guess.
Pick retrieval that fits your corpus, expose it as a tool the agent can skip, translate queries into the domain’s own words, and close the loop with automated evals.
The through-line
If there’s one theme across all five, it’s that the architecture co-evolves with model capability. None of these patterns are timeless. Early on, weaker models forced heavy scaffolding — long, demanding prompts that spelled out every step, rigid pipelines, defensive parsing. The single-deep-agent pattern in section 1 wasn’t even an option a year ago; “deep agents” as a concept, and models reliable enough to drive their own tool loop, arrived during this project. A lot of our best work was deleting scaffolding the model had outgrown.
So treat these as decisions to revisit, not commandments. The durable part isn’t any specific topology — it’s the discipline underneath: decide who owns state, design explicit exits and recovery, isolate background work, fit retrieval to your data, and instrument enough that metrics tell you what to change next. The framework gives you genuinely good primitives — checkpointing, interrupt(), streaming, skills — and the models keep getting better. What stays your job is drawing the boundaries, and noticing when a model got good enough to let you tear a boundary out.