LLM Evals for Production Agents: A Practical Guide
89% of teams running AI agents in production have observability. Only 52% have evals. That 37-point gap is where production quality dies — here's how to close it.
LangChain's State of Agent Engineering survey surfaced a number that should bother every engineering team: 89% of production agent teams have observability, but only 52% have evals. That 37-point gap is not a tooling gap. It's a mindset gap — teams treat evals as a nice-to-have test suite rather than the primary signal for whether their agent actually works.
Logs tell you what happened. Evals tell you whether it was correct.
This guide covers how to build a practical eval suite for a production LLM agent — from unit-level checks to full trace evaluation — with runnable Python examples.
Why observability alone isn't enough
Traces and logs answer operational questions: latency, token usage, error rates, retry counts. They're essential for debugging individual runs. But they can't answer the question that matters: did the agent do the right thing?
An agent that takes 4 tool calls instead of 2 to answer a question might look fine in your latency dashboards. An agent that hallucinates a policy or calls the wrong tool with plausible-looking arguments will log clean traces. Observability catches infrastructure failures. Evals catch correctness failures.
The other reason evals matter more now: agent actions are consequential. When your agent can draft an email, write to a database, or place an order, a wrong answer isn't just a bad UX — it's an action that happened in the real world.
The three levels of agent evals
Think of agent evals in three layers, each answering a different question:
| Level | What it tests | Frequency |
|---|---|---|
| Unit evals | Individual components: retrieval quality, prompt output, tool-call correctness | Every commit |
| Trace evals | Full runs: did the agent use the right tools, in the right order, with correct inputs? | Every commit + daily on prod samples |
| End-to-end evals | Final output quality against golden answers or human judgment | Before deploys + weekly sampling |
Don't skip straight to end-to-end. Unit evals are cheap, fast, and catch 80% of regressions before they ever reach production.
Level 1: Unit evals
Tool-call correctness
The most common agent failure is not "wrong answer" — it's "called the wrong tool" or "called the right tool with wrong arguments." Write explicit tests for this.
import pytest
from unittest.mock import patch, MagicMock
from your_agent import run_agent # your agent entry point
TOOL_CALL_CASES = [
{
"input": "What's the refund policy for order #4821?",
"expected_tool": "lookup_order",
"expected_args_subset": {"order_id": "4821"},
},
{
"input": "Cancel order #4821",
"expected_tool": "cancel_order",
"expected_args_subset": {"order_id": "4821"},
},
{
"input": "What are your store hours?",
"expected_tool": "search_faq",
"expected_args_subset": {},
},
]
@pytest.mark.parametrize("case", TOOL_CALL_CASES)
def test_tool_selection(case):
captured_calls = []
def mock_tool_executor(tool_name, args):
captured_calls.append({"tool": tool_name, "args": args})
return {"result": "mock_response"}
with patch("your_agent.execute_tool", side_effect=mock_tool_executor):
run_agent(case["input"])
assert len(captured_calls) > 0, "Agent made no tool calls"
first_call = captured_calls[0]
assert first_call["tool"] == case["expected_tool"], (
f"Expected tool '{case['expected_tool']}', got '{first_call['tool']}'"
)
for key, val in case["expected_args_subset"].items():
assert first_call["args"].get(key) == val
Run this on every commit. It's fast (no real LLM calls needed if you mock tool execution) and catches tool routing regressions immediately.
Retrieval quality (for RAG-based agents)
If your agent retrieves context before generating a response, test retrieval separately from generation. Two failure modes are common: retrieving irrelevant chunks, and failing to retrieve the relevant one at all.
from your_retriever import retrieve # your vector search function
RETRIEVAL_CASES = [
{
"query": "What is the maximum refund window?",
"must_contain": "30-day return policy", # substring that must appear in top-k results
"top_k": 3,
},
{
"query": "How do I escalate a complaint?",
"must_contain": "escalation form",
"top_k": 3,
},
]
def test_retrieval_recall():
for case in RETRIEVAL_CASES:
results = retrieve(case["query"], top_k=case["top_k"])
combined_text = " ".join(r["text"] for r in results).lower()
assert case["must_contain"].lower() in combined_text, (
f"Query '{case['query']}' failed to retrieve expected content"
)
This is a recall test, not a precision test. It checks that the right content is somewhere in the retrieved set — precision testing (ranking) requires more sophisticated evaluation.
Level 2: Trace evals
Full traces capture the complete sequence: input → LLM call → tool calls → more LLM calls → output. Evaluating traces lets you catch multi-step reasoning failures that unit tests miss.
Structuring your traces
Every agent run should produce a structured trace. Here's a minimal schema:
from dataclasses import dataclass, field
from typing import Any
import time
@dataclass
class ToolCall:
tool_name: str
args: dict
result: Any
duration_ms: float
@dataclass
class AgentTrace:
run_id: str
input: str
output: str
tool_calls: list[ToolCall] = field(default_factory=list)
total_tokens: int = 0
duration_ms: float = 0.0
error: str | None = None
# In your agent, collect this during execution
def run_agent_with_tracing(user_input: str) -> AgentTrace:
trace = AgentTrace(run_id=str(time.time()), input=user_input, output="")
start = time.time()
# ... your agent logic, appending to trace.tool_calls as you go ...
trace.duration_ms = (time.time() - start) * 1000
return trace
Trace-level assertions
Once you have traces, you can write assertions over them:
def evaluate_trace(trace: AgentTrace, expected: dict) -> dict[str, bool]:
results = {}
# Did it call the right tools?
actual_tools = [tc.tool_name for tc in trace.tool_calls]
results["correct_tools"] = actual_tools == expected.get("tool_sequence", actual_tools)
# Did it stay within step budget?
results["within_step_budget"] = len(trace.tool_calls) <= expected.get("max_steps", 10)
# Did it produce output without errors?
results["no_error"] = trace.error is None
results["has_output"] = bool(trace.output.strip())
return results
LLM-as-judge for output quality
For subjective quality (helpfulness, tone, accuracy), use a separate LLM call as an evaluator. This is the technique behind most automated quality evals today.
import anthropic
client = anthropic.Anthropic()
JUDGE_PROMPT = """You are evaluating the output of a customer support AI agent.
User question: {question}
Agent response: {response}
Expected answer (ground truth): {ground_truth}
Score the response on:
1. Factual accuracy (0-3): Does it match the ground truth?
2. Completeness (0-2): Does it answer the full question?
3. Tone (0-2): Is it professional and helpful?
Return JSON only: {{"accuracy": int, "completeness": int, "tone": int, "reasoning": str}}"""
def llm_judge(question: str, response: str, ground_truth: str) -> dict:
result = client.messages.create(
model="claude-haiku-4-5-20251001", # fast and cheap for batch evals
max_tokens=256,
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
question=question,
response=response,
ground_truth=ground_truth
)
}]
)
import json
return json.loads(result.content[0].text)
Use a fast, cheap model for the judge (Haiku-class). Reserve your expensive frontier model for the agent itself. Calibrate your judge periodically by checking its scores against human labels on a sample of 20–30 cases.
Level 3: End-to-end evals with a golden dataset
A golden dataset is a curated set of 100–200 inputs with known-correct outputs. Build it by:
- Sampling 50 real production queries (first week of your agent going live)
- Having a domain expert write correct answers for each
- Adding 50–100 adversarial/edge cases you want to guard against
Run end-to-end evals before every deploy and weekly on live traffic samples:
import json
from pathlib import Path
def run_eval_suite(golden_path: str) -> dict:
golden = json.loads(Path(golden_path).read_text())
scores = []
for case in golden["cases"]:
trace = run_agent_with_tracing(case["input"])
judgment = llm_judge(
question=case["input"],
response=trace.output,
ground_truth=case["expected_output"]
)
total = judgment["accuracy"] + judgment["completeness"] + judgment["tone"]
scores.append({
"input": case["input"],
"score": total,
"max_score": 7,
"judgment": judgment,
})
passing = sum(1 for s in scores if s["score"] >= 5)
return {
"total": len(scores),
"passing": passing,
"pass_rate": passing / len(scores),
"scores": scores,
}
Set a minimum pass rate threshold (e.g., 85%) and gate deploys on it.
Eval frameworks: which to use
| Framework | Best for | Hosted/Self-hosted | LLM-as-judge |
|---|---|---|---|
| DeepEval | Agent + RAG evals, pytest integration | Both | Yes |
| Ragas | RAG pipeline evals specifically | Both | Yes |
| Langsmith Evals | LangChain-based agents | Hosted | Yes |
| Braintrust | Dataset management + tracing | Hosted | Yes |
| Promptfoo | Prompt regression testing | Self-hosted | Yes |
| Custom (as above) | Full control, any stack | Self-hosted | Yes |
For most teams getting started: use DeepEval if you want a framework with batteries included, or roll a lightweight custom suite if you want full control and no extra dependencies.
Common pitfalls
1. Evaluating against your own model's outputs If you use GPT-5 to generate your golden answers and GPT-5 as your judge, you're measuring self-consistency, not correctness. Use human-written golden answers, at least for your core eval set.
2. Letting the golden dataset go stale Your product changes, your prompts change, your tools change. Review the golden dataset quarterly. Add cases when you fix a real production bug.
3. Only running evals on happy path inputs Real users ask ambiguous questions, ask out-of-scope questions, and try to break things. Your eval set needs failure modes and edge cases — aim for 30% non-happy-path.
4. Treating pass rate as the only signal A 90% pass rate hiding consistent failure on one category is worse than a uniform 80% pass rate. Break down your scores by category (tool type, query type, user segment) and track per-category trends.
5. Skipping calibration of your LLM judge An LLM judge that disagrees with human raters 40% of the time gives you false confidence. Sample 30 cases, have a human score them, and measure agreement. If Pearson correlation is below 0.7, your judge prompt needs work.
A practical rollout plan
If you have zero evals today, don't try to build everything at once:
- Week 1: Write 20 tool-call unit tests for your most-used tools. Run them in CI.
- Week 2: Add trace collection to your agent. Store traces in a database or object store.
- Week 3: Build a 50-case golden dataset. Write an LLM judge. Get a baseline pass rate.
- Week 4: Gate staging deploys on your pass rate. Add one end-to-end eval run per week.
At this point you're ahead of 48% of production agent teams.
The bottom line
Observability tells you your agent ran. Evals tell you your agent worked. You need both, but only one of them tells you whether to ship.
Start with tool-call unit tests — they're the cheapest, fastest, and catch the most common failure mode. Layer in trace evals and a golden dataset as your system matures. Use LLM-as-judge for quality scoring, but calibrate it. And treat your eval dataset as a first-class engineering artifact: version it, review it, and grow it.
If you want to go deeper on building production-grade AI systems — including evals, agent architecture, and real-world deployment patterns — check out the Joinloop AI Engineering Cohort. We cover this end-to-end with hands-on projects.
Want to build this yourself?
Join a live cohort and go from reading about AI to shipping it — with practitioners who do it daily.