Jun 5, 2026

5 Things I Learned Running LangGraph Multi-Agent in Production for a Year

I spent the past year building and operating a LangGraph multi-agent system in production: an AI assistant for a B2B device-management platform. It went from a single agent to a multi-agent setup and then through a full rewrite, and a lot of what I “learned” was really just finding out which design decisions survive past week 40.

This isn’t a war-stories post. It’s the five design patterns I’d reach for on day one of the next agent app, plus the reasoning that earned each one: orchestration, state and history, control flow, scheduled work, and RAG.

1. Orchestration: start with one agent and skills, not a fleet of sub-agents

The textbook pattern is a supervisor routing to specialized child agents. We built exactly that, with separate agents for diagnostics, config, logs, and inspection, and it scaled badly. Every supervisor-to-child-to-supervisor hop re-serializes the context into the next LLM call. Child agents sharing one state dict produced duplicated history and ordering bugs. And “who owns this request” turned into a routing problem of its own.

We collapsed it into a single deep agent that calls tools directly and loads skills: folders of markdown (instructions plus example commands) that get discovered at runtime.

agent = create_deep_agent(
    model=get_llm(),
    tools=[save_memory, schedule_task, ...],   # a few cross-cutting tools
    subagents=[],                              # ← no child agents
    skills=["/skills"],                        # ← capabilities = markdown files
    middleware=[GreetingFastPath(), InjectContext(), TokenGuard()],
)

Two things made this a durable win rather than just a refactor:

Capabilities became data, not code. Adding a skill means writing a SKILL.md file, not shipping a new agent class. The unit of extension matches the unit of change.
Skill granularity follows the user’s mental model. One skill is one complete task the model can finish in a single reasoning pass, like “diagnose an offline device” with the full command sequence inside. We learned the hard way that splitting one workflow into many micro-skills just multiplies tool-call round-trips.

The middleware stack is worth a look too: GreetingFastPath skips the LLM entirely for “hi”, InjectContext adds the timezone, tenant, and memories, and TokenGuard does the thing I’ll get to below. Middleware is for cross-cutting concerns; the business logic stays in the skills and prompts. A middleware that starts encoding “if the device is offline, run diagnostics” is one that will fight you in three weeks.

Reach for a sub-agent only when you actually need parallelism or hard context isolation. Until then, one agent with tools and skills is flatter, cheaper, and leaves you with a single state to reason about. Multi-agent is something you add when a flat design stops working, not a starting point.

2. State and history: separate execution state from durable history

This is the most important data decision in the whole system, and the easiest to get wrong. Early on we let LangGraph’s checkpointer and our own app-level store both believe they owned the conversation. The result was history that doubled on every retry, and a checkpoint blob that eventually blew past MongoDB’s 16MB document limit.

The fix is a clean ownership split:

Durable history (what the user sees, what you audit) lives in the application layer, modeled after Langfuse. A Session holds metadata and an ordered list of trace IDs, with no message bodies, and each Trace holds one request: its input, output, and intermediate tool calls. It’s permanent, queryable, and filtered by user and tenant on every read.
Execution state (the checkpointer) holds only what’s needed to resume an in-flight run. It’s allowed to forget, and it defends itself with a trim:

def trim_to_fit(messages, blob_size):
    if blob_size <= LIMIT:
        return messages
    keep = max(MIN_KEEP, int(len(messages) * LIMIT / blob_size * 0.8))
    while serialized(messages[-keep:]) > LIMIT and keep > MIN_KEEP:
        keep //= 2
    return sanitize_boundaries(messages[-keep:])  # drop orphaned tool messages

The pieces that make this work in practice:

Rebuild context per request from durable history, not from whatever the checkpointer happens to still hold. Once you do that, the trimming is invisible to the user.
Design state for deletion from the start. Trimming will leave orphaned ToolMessages with no matching AIMessage, so sanitize the boundaries or the next LLM call errors out. If you keep a summarization cutoff index, adjust it when you trim.
Keep the state schema lean. Ours holds page_context, timezone, and platform, and deliberately not JSON schemas or tool definitions; we moved those to on-demand DB lookups to keep the blob small.
Cancel and regenerate are different operations. Cancel just marks a flag, so you don’t destroy evidence. Regenerate physically deletes that trace and everything after it, then trims the checkpoint by the same count. Get this wrong and resumed runs replay stale turns.

A checkpoint is a stack frame, not an archive. Decide which layer owns durable history, give the other an eviction policy, and rebuild context from the durable side every turn.

3. Control flow: design explicit exits, and make interrupt/resume a first-class mechanism

Loops. Our nastiest hang was an agent that retried forever until LangGraph killed it with GraphRecursionError. The root cause wasn’t that the recursion limit was too low. It was a tool that failed silently: given empty input it returned "Error: preferences is empty", the model couldn’t tell what to do differently, so it retried the exact same call. The fix wasn’t a bigger limit. It was giving the model a usable exit:

async def save_preferences(preferences: dict | None = None,
                           description: str | None = None) -> str:
    if not preferences and description:
        preferences = {"description": description}   # accept a fallback shape
    if not preferences:
        # actionable failure: the model can change strategy, not just retry
        return "Error: pass either a `preferences` dict or a `description` string."
    ...

Recursion limits are a backstop, not a control-flow strategy. Any node that can route back to itself needs a designed exit: a budget, a fallback, and failure messages phrased as instructions the model can actually act on.

Pauses. The other half of control flow is stopping cleanly and coming back. We needed this because access tokens expire mid-run: a long inspection would 401 halfway through. Instead of dying, a middleware turns the expiry into a pause/resume cycle on top of the checkpointer:

class TokenGuard(AgentMiddleware):
    async def awrap_tool_call(self, request, handler):
        try:
            return await handler(request)
        except TokenExpiredError:
            interrupt({"type": "token_expired"})   # checkpoint + pause
            return await handler(request)           # resume: token is fresh now

The reusable insight is that interrupt() plus a durable checkpointer gives you a general “pause this run and pick it up later, maybe on another pod” primitive, not just a token-refresh trick. The same shape covers human-in-the-loop approval, long-task confirmation, and backpressure when you’re out of some resource. The interrupt payload just carries a type the frontend dispatches on.

Two traps worth knowing. Frameworks love to wrap tool execution in except Exception, which will swallow the GraphInterrupt you just raised, so we had to special-case GraphBubbleUp to let it re-raise. And you have to persist the pending-resume state before you emit the interrupt event, or a fast client will resume against data that isn’t there yet.

Don’t tune the recursion limit; design the exit. And treat interrupt/resume as core infrastructure, because it’s what makes everything else recoverable.

4. Scheduled and background work: isolate every run, and mind identity

Letting users schedule recurring agent runs (“inspect my devices every morning”) looks like a small feature, and it’s actually where the subtle bugs live, because background execution loses all the request-scoped context you took for granted.

A scheduler loop persists jobs (cron/daily/weekly/interval) to the database and wakes periodically to run what’s due. The patterns that mattered:

Each run is fully isolated. Give it a fresh session_id, a fresh trace, and its own execution context. Never reuse the originating chat session, or scheduled output contaminates the user’s live history.
Catch up on restart. On startup, reload the active jobs and immediately fire anything already overdue. A pod restart should never silently skip a run.
Prepare every context variable the run depends on, up front. This is the part that bites. In a live request, identity, tenant, locale, and auth tokens all arrive on the request and sit in context for free. A scheduled run has none of that: the token that authorized “create this job” expired long before it fires. Before the agent runs, you have to reconstruct the context it will read from. Capture whatever identity and scoping the job needs at creation time, persist it on the job record, and re-establish it (plus a freshly minted service token) when the job runs. Working out exactly which variables your tools read from context, and pinning each one onto the job, is the highest-leverage thing you can do here.
Always advance next_run_at and record last_error, even on failure. A partial failure, like the agent succeeding but the notification not going out, should be logged, not fatal. A job that doesn’t reschedule itself on error is a job that runs once.

Background work strips away request context. Make every run idempotent enough to retry, and treat “rebuild the context the agent expects” as an explicit step. The failure mode here isn’t a crash; it’s a job that quietly runs with the wrong scope.

5. RAG: fit the retrieval to the domain, expose it as a tool, and let evals drive it

We started with the default, embeddings into a vector DB, and eventually moved the main knowledge base to jieba plus BM25 with a custom dictionary. For a vertical corpus full of exact terminology (device models, protocol names, config keys), keyword precision beats semantic similarity, costs no embedding round-trips, and lets you actually see the score when something ranks wrong. A domain dictionary keeps ER805 and VLAN as single tokens instead of splitting them into noise. We still use vector search where it fits, like semantic lookups over structured config schemas, so the lesson isn’t “BM25 wins.” It’s that you should match the retrieval method to the corpus instead of reaching for vector search by default.

Three patterns carried the quality:

RAG is a tool the agent chooses to call, not a fixed pipeline step. The model decides whether a question needs documents at all (“what time is it” doesn’t), which saves tokens and latency. Keep retrieval, query rewriting, and evaluation as separate concerns: rewriting and retrieval are callable tools, and evaluation runs async afterward.
Query rewriting is terminology translation, not question improvement. Its job is to turn user phrasing into the corpus’s phrasing, so “guest network” becomes “VLAN” and “can’t get online” becomes some mix of “WAN”, “cellular”, “DNS”, and “route”. Then fuse the rewrites with reciprocal-rank fusion so different phrasings of the same need all hit. Match the docs’ own headings rather than chasing variety.
For structured data, constrain the query instead of embedding everything. A config schema can run to 10k+ tokens. A tool that takes a path like system.ntp and returns just that fragment keeps the prompt small and the answer grounded, which beats dumping the whole schema into context.

Finally, let the metrics drive iteration instead of your gut. We score a sample of traffic with RAGAS (faithfulness, answer relevancy, context precision) and write the scores back onto the trace. Low faithfulness usually means retrieval is weak or k is too high; low relevancy means the rewriting needs work; low precision points at chunking or k. Without that loop, every RAG change is a guess.

Pick retrieval that fits your corpus, expose it as a tool the agent can skip, translate queries into the domain’s own words, and close the loop with automated evals.

The through-line

If there’s one theme across all five, it’s that the architecture co-evolves with the model’s capability. None of these patterns are timeless. Early on, weaker models forced a lot of scaffolding: long prompts that spelled out every step, rigid pipelines, defensive parsing everywhere. The single-deep-agent pattern from section 1 wasn’t even an option a year ago. “Deep agents” as a concept, and models reliable enough to drive their own tool loop, both showed up during this project. A lot of our best work was deleting scaffolding the model had outgrown.

So treat these as decisions to revisit, not commandments. The durable part isn’t any specific topology. It’s the discipline underneath: decide who owns state, design your exits and recovery, isolate background work, fit retrieval to your data, and instrument enough that the metrics can tell you what to change next. The framework hands you genuinely good primitives (checkpointing, interrupt(), streaming, skills) and the models keep getting better. What stays your job is drawing the boundaries, and noticing when a model has gotten good enough that you can tear one back out.