LangGraph: Build Production-Grade AI Agents That Don't Fall Apart

If you've built an agent with a linear LangChain chain or a simple loop around an LLM, you've probably hit the same wall: the agent works in the demo, then breaks in production the moment it hits an unexpected API response, a mid-workflow restart, or a task that needs human sign-off before continuing.

LangGraph solves this by treating your agent as a directed graph, not a chain. Nodes are functions. Edges are routing decisions. State is explicit, persistent, and resumable. As of mid-2026, it gets 34.5 million monthly downloads and runs in production at companies like Uber, LinkedIn, and JPMorgan.

This post covers the core model, shows you working Python code, and flags the pitfalls that bite engineers early on.

Why Chains Break in Production

A chain is a linear sequence: input → step A → step B → step C → output. That works fine for a single-shot summarizer. It falls apart when:

Step B fails and you need to retry only that step without rerunning A
The agent needs to branch: "if the tool returned an error, try a different tool"
A server restart mid-execution loses all progress
A human needs to approve an action before step C runs

Chains have no concept of state between steps, no native branching, and no persistence. You end up bolting these on as custom wrappers until your "simple chain" is a pile of duct tape.

LangGraph's answer: model the agent explicitly as a stateful graph from the start.

The Core Model: State, Nodes, Edges

Every LangGraph agent has three primitives.

State is a typed Python dict that flows through the entire graph. Every node reads from it and writes to it.

Nodes are Python functions that take the current state and return an updated state dict.

Edges connect nodes. They can be fixed (A → B) or conditional (a function decides which node to go to next based on the current state).

Here's the simplest possible example — a research agent that calls a search tool and then generates a summary:

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

# 1. Define the state schema
class AgentState(TypedDict):
    query: str
    search_results: list[str]
    summary: str
    error: str | None

# 2. Define nodes
def search_node(state: AgentState) -> dict:
    """Call a search tool and store results."""
    results = my_search_tool(state["query"])  # your actual tool call
    return {"search_results": results, "error": None}

def summarize_node(state: AgentState) -> dict:
    """Summarize the search results."""
    prompt = f"Summarize these results: {state['search_results']}"
    summary = llm.invoke(prompt).content
    return {"summary": summary}

def handle_error_node(state: AgentState) -> dict:
    """Fallback when something goes wrong."""
    return {"summary": f"Could not complete research: {state['error']}"}

# 3. Routing logic
def should_continue(state: AgentState) -> str:
    if state.get("error"):
        return "handle_error"
    return "summarize"

# 4. Build the graph
builder = StateGraph(AgentState)

builder.add_node("search", search_node)
builder.add_node("summarize", summarize_node)
builder.add_node("handle_error", handle_error_node)

builder.set_entry_point("search")

# Conditional edge after search
builder.add_conditional_edges(
    "search",
    should_continue,
    {
        "summarize": "summarize",
        "handle_error": "handle_error",
    }
)

builder.add_edge("summarize", END)
builder.add_edge("handle_error", END)

graph = builder.compile()

# 5. Run it
result = graph.invoke({"query": "LangGraph production patterns 2026"})
print(result["summary"])

The key difference from a chain: should_continue is a first-class part of the graph, not an if buried inside a step. You can visualize the graph, test the routing function independently, and add more branches without touching the nodes themselves.

Checkpointing: Survive Restarts and Lambda Cold Starts

The feature that actually makes LangGraph production-ready is checkpointing. Every time a node completes, LangGraph saves the full state to a database. If your process dies mid-workflow, the agent resumes from the last completed node.

from langgraph.checkpoint.postgres import PostgresSaver

# Use Postgres as the checkpoint store
connection_string = "postgresql://user:pass@host/dbname"
checkpointer = PostgresSaver.from_conn_string(connection_string)

# Compile with the checkpointer
graph = builder.compile(checkpointer=checkpointer)

# Each run needs a thread_id — this is how LangGraph identifies a workflow instance
config = {"configurable": {"thread_id": "research-job-42"}}

result = graph.invoke(
    {"query": "vector database benchmarks 2026"},
    config=config
)

If the process crashes after search_node completes but before summarize_node, the next call with the same thread_id picks up at summarize_node. The search results are already in the saved state — no duplicate API calls.

For local development, swap PostgresSaver with MemorySaver from langgraph.checkpoint.memory. For serverless (Lambda, Cloud Run), use a managed store like Redis or Postgres — the in-memory saver obviously won't survive restarts.

Human-in-the-Loop with `interrupt()`

Some actions are irreversible: sending an email, writing to a database, charging a customer. You want the agent to pause before those steps and wait for a human to approve.

LangGraph's interrupt() function pauses execution mid-graph and waits for the calling code to provide an override.

from langgraph.types import interrupt

def send_email_node(state: AgentState) -> dict:
    draft = generate_email_draft(state)
    
    # Pause here and surface the draft to the human
    human_decision = interrupt({
        "action": "send_email",
        "draft": draft,
        "recipient": state["recipient"]
    })
    
    if human_decision["approved"]:
        send_email(draft, state["recipient"])
        return {"email_sent": True}
    else:
        return {"email_sent": False, "rejection_reason": human_decision.get("reason")}

On the calling side, you handle the GraphInterrupt exception, show the draft to the user, and resume with their decision:

from langgraph.errors import GraphInterrupt

config = {"configurable": {"thread_id": "email-workflow-7"}}

try:
    result = graph.invoke(initial_state, config=config)
except GraphInterrupt as e:
    # e.value contains the dict you passed to interrupt()
    draft_info = e.value[0].value
    
    # Show the draft to the user (in your UI, Slack, wherever)
    user_approved = ask_human_for_approval(draft_info)
    
    # Resume the graph with the human's decision
    result = graph.invoke(
        Command(resume={"approved": user_approved}),
        config=config
    )

This pattern is useful beyond email: database schema migrations, financial transactions, customer-facing messages, infrastructure changes — anything where "oops" is expensive.

Patterns and Tradeoffs

Pattern	When to use	Tradeoff
Sequential graph	Simple, linear workflows	No benefit over a chain; adds boilerplate
Conditional routing	Different paths based on tool output	Routing functions must be deterministic
Retry loop	Node calls a flaky external API	Need a max-retries guard to avoid infinite loops
Human-in-the-loop	Irreversible or high-stakes actions	Workflow hangs until resumed; needs timeout handling
Parallel nodes	Independent tool calls (e.g. search + DB lookup)	State merging can conflict; use `Annotated` reducer
Subgraphs	Large agents with distinct sub-workflows	Good for team-level separation; adds compile complexity

Common Pitfalls

Not typing your state. An untyped state dict is hard to debug. When a node returns a key that nothing reads, or overwrites a key another node expected, you'll spend an hour in logs. Use TypedDict strictly from the start.

Missing a max-retries guard in retry loops. A conditional edge that sends the agent back to a failing node will spin forever. Always track a retry count in state and route to an error node after N attempts.

Using MemorySaver in serverless. It's documented as "for testing." In a Lambda function, the memory doesn't persist between invocations. Use Postgres or Redis.

One thread_id for all users. Each workflow instance needs its own thread_id. If you reuse the same ID, the second run resumes from the saved state of the first.

Not visualizing the graph during development. LangGraph can render a Mermaid diagram of your compiled graph. Run print(graph.get_graph().draw_mermaid()) early and often — it's much easier to catch a wrong edge in a diagram than in a stack trace.

Putting business logic in routing functions. Routing functions should only read state and return a string. Any logic that modifies state or calls external services belongs in a node. Mixing the two makes the graph non-deterministic and hard to test.

When LangGraph Is (and Isn't) the Right Tool

LangGraph adds real complexity. It's the right choice when:

Your workflow has conditional branches or retry loops
You need crash recovery or long-running multi-step execution
You need human approval gates
The workflow involves multiple tools that can fail independently

It's overkill when:

You're doing a single LLM call with a fixed prompt
Your "chain" is genuinely linear with no failure modes
You're prototyping and speed matters more than reliability

For simple use cases, a direct SDK call or a basic LangChain RunnableSequence is faster to write and easier to read. Graduate to LangGraph when the complexity of your error handling starts to exceed the complexity of the actual task.

Next Steps

The official LangGraph docs at docs.langchain.com/oss/python/langgraph/overview are the best place to start. The langgraph-checkpoint-postgres and langgraph-checkpoint-redis packages are both production-tested. For observability, LangSmith integrates natively and shows per-node traces without any instrumentation code.

If you want to go deeper on production agent patterns — memory, evals, multi-agent orchestration, and deploying agents behind an API — that's exactly what we cover in the Joinloop AI Engineering Cohort. The cohort is hands-on: you build a real agent system, break it, fix it, and ship it.