Demystifying Evals for AI Agents

Authors: Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe (Anthropic) Published: January 9, 2026 Source: Anthropic Engineering Blog

Chapter 1: Introduction

Evaluations help teams deploy AI agents with confidence by catching issues before they impact users. Without them, teams get trapped in reactive cycles, identifying problems only after production failures. Evaluations make behavioral changes and problems visible early, compounding in value throughout an agent's lifecycle.

As described in "Building effective agents," agents operate across multiple turns, calling tools, modifying state, and adapting based on results. These same capabilities making agents useful -- autonomy, intelligence, flexibility -- also create evaluation challenges.

Through internal work and customer partnerships, effective evaluation design patterns have emerged for agents across different architectures and real-world deployments.

Chapter 2: The Structure of an Evaluation

An evaluation ("eval") tests an AI system by providing input and applying grading logic to measure output success. This chapter focuses on automated evals that run during development without real users.

Single-turn evaluations remain straightforward: a prompt, response, and grading logic. Multi-turn evaluations have become increasingly common as AI capabilities advanced.

Agent evaluations are more complex. Agents use tools across many turns, modifying environment state and adapting continuously -- mistakes propagate and compound. Frontier models discover creative solutions exceeding static eval limits. For example, Opus 4.5 solved a flight-booking benchmark problem by finding a policy loophole, technically failing the eval but delivering superior user value.

Key Definitions

Task (problem/test case): A single test with defined inputs and success criteria.
Trial: One attempt at a task; multiple trials produce consistent results given output variability.
Grader: Logic scoring agent performance aspects; tasks contain multiple graders with assertions/checks.
Transcript (trace/trajectory): Complete trial record including outputs, tool calls, reasoning, intermediate results, and interactions. For the Anthropic API, the full messages array containing all API calls and responses.
Outcome: Final environment state after trial completion (e.g., a reservation existing in a database, not just confirmation text).
Evaluation harness: End-to-end infrastructure providing instructions and tools, running tasks concurrently, recording steps, grading outputs, and aggregating results.
Agent harness (scaffold): System enabling models to act as agents by processing inputs, orchestrating tool calls, and returning results. Evaluation assesses harness and model together.
Evaluation suite: A task collection measuring specific capabilities or behaviors with shared broad goals.

Chapter 3: Why Build Evaluations?

Teams can progress far initially through manual testing, dogfooding, and intuition. More rigorous evaluation appears to slow shipping. However, after prototyping and production scaling, development without evals breaks down.

The Breaking Point

Breaking points occur when users report degraded performance and teams "fly blind" with only guessing available. Debugging becomes reactive: complaints trigger reproduction attempts, bug fixes, and hoped-for non-regressions. Teams cannot:

Distinguish real regressions from noise
Test changes automatically against hundreds of scenarios pre-shipping
Measure improvements quantitatively

This progression appears repeatedly. Claude Code began with fast iteration from Anthropic and external user feedback. Later, evals were added progressively -- initially for narrow areas like concision and file edits, then for complex behaviors like over-engineering. These evals identified issues, guided improvements, and focused research-product collaboration.

When to Start

Writing evals helps at any lifecycle stage. Early on, they force teams to specify success; later, they maintain consistent quality bars. Some teams create evals at development start; others add them at scale when evals become the improvement bottleneck. Evals particularly help early development by explicitly encoding expected behavior, preventing different engineers from diverging on edge-case handling. Regardless of timing, evals accelerate development.

Real-World Examples

Descript: Their video editing agent uses evals across three dimensions: avoiding breaks, following instructions, and quality execution. They evolved from manual grading to LLM graders with product-team criteria and periodic human calibration, running separate quality and regression suites.
Bolt: Their AI team added evals after achieving wide usage; within three months, they built systems grading with static analysis, browser agents, and LLM judges for instruction-following assessment.

Evals and Model Adoption

Evals shape model adoption speed. When powerful models emerge, teams without evals face weeks of testing while competitors run established suites, determining strengths, tuning prompts, and upgrading within days.

Once evals exist, baselines and regression tests come free: latency, token usage, cost-per-task, and error rates track on static task banks. Evals become the highest-bandwidth communication between product and research teams, defining optimization metrics. Their compounding value is overlooked because upfront costs are visible while benefits accumulate later.

Chapter 4: Types of Graders for Agents

Agent evaluations combine three grader types: code-based, model-based, and human. Each evaluates transcript or outcome portions. Effective design requires selecting appropriate graders.

Code-Based Graders

Methods:

String matching (exact, regex, fuzzy)
Binary tests (fail-to-pass, pass-to-pass)
Static analysis (lint, type, security)
Outcome verification
Tool call verification (tools used, parameters)
Transcript analysis (turns taken, token usage)

Strengths: Fast, cheap, objective, reproducible, easy to debug, verify specific conditions.

Weaknesses: Brittle against valid variations, lacking nuance, limited for subjective tasks.

Model-Based Graders

Methods:

Rubric-based scoring
Natural language assertions
Pairwise comparison
Reference-based evaluation
Multi-judge consensus

Strengths: Flexible, scalable, captures nuance, handles open-ended tasks, manages freeform output.

Weaknesses: Non-deterministic, expensive, requires human calibration.

Human Graders

Methods:

Subject matter expert (SME) review
Crowdsourced judgment
Spot-check sampling
A/B testing
Inter-annotator agreement

Strengths: Gold standard quality, matches expert judgment, calibrates model graders.

Weaknesses: Expensive, slow, often needs expert access at scale.

Combining Graders

Scoring combines graders through:

Weighting: Threshold-based combination of scores
Binary: All graders must pass
Hybrid: Mixed approaches tailored to the use case

Chapter 5: Capability Versus Regression Evals

Capability Evals

Capability ("quality") evals ask what agents perform well on, starting with low pass rates and targeting agent struggles -- climbing hills. These evals push the frontier of what agents can do.

Regression Evals

Regression evals ask whether agents still handle previously-mastered tasks, maintaining near-100% pass rates. Declining scores signal breakage requiring fixes. As teams climb capability evals, regression evals prevent backsliding.

The Lifecycle

After launch and optimization, high-performing capability evals graduate to become continuous regression suites, measuring sustained reliability. The two types work together: capability evals drive improvement while regression evals ensure stability.

Chapter 6: Evaluating Coding Agents

Coding agents write, test, and debug code, navigating codebases and running commands like developers. Effective evals use well-specified tasks, stable test environments, and thorough code testing.

Deterministic Grading

Deterministic grading suits coding agents -- software typically evaluates straightforwardly: does code run and tests pass?

Notable benchmarks:

SWE-bench Verified: Provides GitHub issues from popular Python repositories, grading by test-suite execution. Solutions pass only by fixing failing tests without breaking existing ones. LLM progress jumped from 40% to over 80% in one year.
Terminal-Bench: Tests end-to-end technical tasks like Linux kernel compilation or ML model training.

Beyond Pass-or-Fail

Beyond pass-or-fail outcome testing, grading transcripts proves useful. Code-quality heuristics evaluate generated code beyond test passage; model-based graders with clear rubrics assess behaviors like tool calling or user interaction.

Example Coding Evaluation

task:
  id: "fix-auth-bypass_1"
  desc: "Fix authentication bypass when password field is empty..."
  graders:
    - type: deterministic_tests
      required: [test_empty_pw_rejected.py, test_null_pw_rejected.py]
    - type: llm_rubric
      rubric: prompts/code_quality.md
    - type: static_analysis
      commands: [ruff, mypy, bandit]
    - type: state_check
      expect:
        security_logs: {event_type: "auth_blocked"}
    - type: tool_calls
      required:
        - {tool: read_file, params: {path: "src/auth/*"}}
        - {tool: edit_file}
        - {tool: run_tests}
  tracked_metrics:
    - type: transcript
      metrics:
        - n_turns
        - n_toolcalls
        - n_total_tokens
    - type: latency
      metrics:
        - time_to_first_token
        - output_tokens_per_sec
        - time_to_last_token

This illustration showcases the full range of graders. In practice, coding evals primarily rely on unit tests and LLM rubrics, adding graders and metrics only as needed.

Chapter 7: Evaluating Conversational Agents

Conversational agents interact in support, sales, or coaching contexts. Unlike traditional chatbots, they maintain state, use tools, and act mid-conversation. Interaction quality itself gets evaluated.

Design Considerations

Effective evals use verifiable end-state outcomes and rubrics capturing task completion and interaction quality. Unlike other evals, conversational evals often require a second LLM simulating users.

Success is multidimensional: ticket resolution (state check), sub-10-turn completion (transcript constraint), and appropriate tone (LLM rubric) all matter simultaneously.

Benchmarks

Notable benchmarks include tau-Bench and its successor tau2-Bench, which simulate multi-turn interactions across domains like retail support and airline booking, with one model playing user personas while agents navigate realistic scenarios.

Example Conversational Evaluation

graders:
  - type: llm_rubric
    rubric: prompts/support_quality.md
    assertions:
      - "Agent showed empathy for customer's frustration"
      - "Resolution was clearly explained"
      - "Agent's response grounded in fetch_policy tool results"
  - type: state_check
    expect:
      tickets: {status: resolved}
      refunds: {status: processed}
  - type: tool_calls
    required:
      - {tool: verify_identity}
      - {tool: process_refund, params: {amount: "<=100"}}
      - {tool: send_confirmation}
  - type: transcript
    max_turns: 10
tracked_metrics:
  - type: transcript
    metrics:
      - n_turns
      - n_toolcalls
      - n_total_tokens
  - type: latency
    metrics:
      - time_to_first_token
      - output_tokens_per_sec
      - time_to_last_token

Conversational evals typically use model graders assessing communication quality and goal completion, handling multiple correct solutions gracefully.

Chapter 8: Evaluating Research Agents

Research agents gather, synthesize, and analyze information, producing answers or reports. Unlike coding with binary test signals, research quality is judged relative to task context. What counts as "comprehensive," "well-sourced," or "correct" depends on scope: market scans, acquisition due diligence, and scientific reports require different standards.

Unique Challenges

Experts may disagree on synthesis comprehensiveness
Ground truth shifts as reference content changes constantly
Longer open-ended outputs create more opportunity for error

BrowseComp tests whether agents can find needles-in-haystacks across the open web -- questions easy to verify but hard to solve.

Evaluation Strategy

One effective research eval strategy combines grader types:

Groundedness checks: Verify claims are supported by retrieved sources
Coverage checks: Define essential facts an answer must contain
Source quality checks: Confirm consulted sources are authoritative rather than merely first-retrieved

For objectively-correct answers ("What was Company X's Q3 revenue?"), exact matching works. LLMs flag unsupported claims and coverage gaps while verifying open-ended synthesis for coherence and completeness.

Given the subjectivity of research quality, LLM rubrics need frequent calibration against expert human judgment.

Chapter 9: Evaluating Computer Use Agents

Computer use agents interact with software identically to humans -- screenshots, mouse clicks, keyboard inputs, scrolling -- rather than APIs or code. They can use any GUI-based application, from design tools to legacy enterprise software.

Environment Requirements

Evaluation requires sandbox environments where agents use applications and achieve intended outcomes.

Benchmarks

WebArena: Tests browser-based tasks using URL and page-state checks verifying correct navigation, plus backend verification for data-modifying tasks (confirming actual order placement, not just confirmation page appearance).
OSWorld: Extends to full operating system control with evaluation scripts inspecting diverse post-task artifacts: file system state, application configs, database contents, and UI properties.

Efficiency Tradeoffs

Browser use agents balance token efficiency against latency. DOM-based interactions execute quickly but consume many tokens; screenshot-based interactions execute slowly but prove token-efficient. Claude-for-Chrome development created evals checking agent tool selection, enabling faster, more accurate browser-task completion.

Chapter 10: Non-Determinism in Agent Evaluations

Regardless of type, agent behavior varies between runs, complicating result interpretation. Each task has its own success rate -- perhaps 90% for one task, 50% for another -- and tasks passing one run might fail the next. Measuring how often agents succeed matters.

pass@k

The likelihood that agents get at least one correct solution in k attempts. As k increases, pass@k rises: more attempts mean higher probability of at-least-one success. A 50% pass@1 means models succeed at half their tasks on the first try.

Coding agents often focus on first-try solutions (pass@1). Other cases accept multiple attempts as long as one works.

pass^k

The probability that all k trials succeed. As k increases, pass^k falls since demanding consistency across more trials raises difficulty. With 75% per-trial success and three trials, pass^k probability is (0.75)^3, approximately 42%.

Customer-facing agents require this kind of consistency that users expect reliably.

Choosing the Right Metric

Both metrics serve purposes; product requirements determine usage:

pass@k for single-success tools where one good result suffices
pass^k for consistent agent reliability where users need dependable outcomes

Chapter 11: From Zero to One -- A Roadmap for Great Evals

This chapter provides practical, field-tested advice for progressing from no evals to trustworthy ones. The guiding principle: define success early, measure clearly, iterate continuously.

Step 0: Start Early

Teams delay evals thinking hundreds of tasks are necessary. Reality: 20-50 simple tasks from real failures work excellently. Early agent changes produce obvious system impacts, sufficient for small sample sizes. Mature agents need larger, harder evals for detecting subtle effects, but early 80/20 approaches work best. Delayed evals become harder -- early product requirements naturally translate to test cases, while waiting means reverse-engineering success from live systems.

Step 1: Start With Existing Manual Tests

Begin with behaviors already verified during development -- pre-release checks and common user tasks. With production systems, review bug trackers and support queues. Converting user-reported failures into test cases ensures the suite reflects actual usage; prioritizing by impact maximizes effort investment.

Step 2: Write Unambiguous Tasks With Reference Solutions

Task quality proves harder than expected. Good tasks are those where two domain experts independently reach identical pass/fail verdicts. Could they personally pass? If not, the task needs refinement. Task specification ambiguity becomes metric noise. The same principles apply to model-based grader criteria: vague rubrics produce inconsistent judgments.

Each task should be passable by a correctly-following agent. This subtlety matters -- if tasks require unspecified filepaths while tests assume particular ones, agents fail without fault. Grader checks should emerge clearly from task descriptions; agents should not fail ambiguous specs.

Frontier models showing 0% pass@100 most often signal broken tasks, not incapable agents, indicating specification and grader issues that need double-checking. Creating reference solutions -- known passing outputs -- proves task solvability and verifies correct grader configuration.

Step 3: Build Balanced Problem Sets

Test both where behaviors should and should not occur. One-sided evals create one-sided optimization. Testing search-when-needed without search-when-unnecessary risks optimizing for searching everything. Avoid class-imbalanced evals.

A lesson learned from Claude.ai web search evals: preventing unwanted searches while preserving research ability proved challenging. Teams built evals both ways -- search-appropriate queries (weather) and knowledge-based answers (Apple's founder). Balancing undertriggering against overtriggering required many refinements. Continuously adding new example problems improves coverage.

Step 4: Build a Robust Eval Harness With Stable Environment

Agent function during evals must resemble production closely; the environment itself should not introduce additional noise.

Each trial starts "clean," isolating from unnecessary shared state between runs (leftover files, cached data, resource exhaustion) causing correlated failures from infrastructure flakiness rather than agent performance. Shared state artificially inflates performance -- examining git history from previous trials unfairly advantages agents. Multiple trials failing from identical environment limitations are not independent since the same factors affect them, rendering results unreliable for measuring actual performance.

Step 5: Design Graders Thoughtfully

Great design involves selecting the best graders: deterministic grading where possible, LLM grading when necessary or when flexibility is needed, human grading judiciously for validation.

Avoid rigid step-checking. A common instinct checks whether agents followed specific step sequences with rigid evaluation. This approach proves too inflexible and brittle since agents regularly find valid unanticipated approaches. Rather than penalizing creativity, grade produced artifacts, not paths taken.

Implement partial credit. Tasks with multiple components benefit from partial credit. Support agents correctly identifying and verifying problems but failing refund processing perform meaningfully better than agents that fail immediately. Representing success as a continuum matters.

Calibrate model graders carefully. LLM-as-judge graders need close human expert calibration minimizing divergence. Preventing hallucinations requires providing escape routes like "Unknown" when insufficient information exists. Structured rubrics grading isolated dimensions with separate LLM judges, rather than single judges grading everything, produce better results. After robust systems develop, occasional human review suffices.

Watch for subtle failure modes. Some evaluations harbor subtle failure modes producing low scores despite good agent performance from grading bugs, harness constraints, or ambiguity. Even sophisticated teams miss these. Opus 4.5 initially scored 42% on CORE-Bench until researchers found multiple issues: rigid grading penalizing minor formatting differences, ambiguous specs, and stochastic irreproducible tasks. After fixes and less-constrained scaffolding, Opus 4.5 scored 95%.

Make graders cheat-resistant. Tasks and graders should demand actual problem-solving rather than loophole exploitation.

Step 6: Check Transcripts

Understanding grader functionality requires reading many trial transcripts and grades. Failed tasks reveal whether genuine agent mistakes happened or graders rejected valid solutions. Transcripts surface key agent and eval behavior details.

Failures should appear fair: clear about agent errors and reasons. When scores plateau, confidence matters that performance and eval represent actual issues, not measurement problems. Reading transcripts verifies evals measure what matters and is a critical agent development skill.

Step 7: Monitor Capability Eval Saturation

100% evals track regressions but provide zero improvement signals. Eval saturation happens when agents pass all solvable tasks, eliminating room for improvement.

SWE-Bench Verified scores began at around 30%; frontier models now near over 80%, approaching saturation. As saturation approaches, progress slows since only the hardest tasks remain, and this can be deceiving -- large capability improvements appear as small score increases.

Qodo initially doubted Opus 4.5's improvements because one-shot coding evals missed gains on longer, complex tasks. In response, they developed agentic frameworks that clarified the actual progress.

Generally, do not trust eval scores at face value without digging into details and reading transcripts. Unfair grading, ambiguous tasks, penalized valid solutions, or harness-constrained models all demand revisions.

Step 8: Keep Evaluation Suites Healthy

Eval suites are living artifacts needing ongoing attention and clear ownership.

Anthropic experimented with approaches and found that dedicated evals teams owning core infrastructure work most effectively, while domain experts and product teams contribute most tasks and run evaluations themselves.

AI product teams should own and iterate evaluations as routinely as maintaining unit tests. Practicing eval-driven development means building evals defining planned capabilities before agents fulfill them, iterating until performance succeeds. New model drops immediately reveal whether capability bets succeeded.

Product teams closest to requirements and users are best positioned to define success. Current model capabilities let users contribute eval tasks as pull requests. Better yet, actively enable them.

Chapter 12: Holistic Agent Understanding Beyond Evals

Automated evaluations run thousands of agent tasks without production deployment or real-user impact. However, a complete picture requires multiple methods working together.

Automated Evals (Programmatic Testing Without Real Users)

Pros: Faster iteration, fully reproducible, no user impact, can run on every commit, test scenarios at scale without production deployment.

Cons: Requires upfront investment, needs ongoing maintenance to avoid drift, can create false confidence if mismatched from real usage.

Production Monitoring (Live System Tracking)

Pros: Reveals real user-scale behavior, catches issues synthetic evals miss, provides ground truth.

Cons: Reactive -- problems reach users before awareness; noisy signals; requires instrumentation investment; lacks grading ground truth.

A/B Testing (Real User Traffic Variants)

Pros: Measures actual user outcomes (retention, task completion), controls confounds, scalable and systematic.

Cons: Slow (days/weeks to reach significance, requires traffic), tests only deployed changes, lacks explanation of underlying changes without thorough transcript review.

User Feedback (Explicit Signals)

Pros: Surfaces unanticipated problems, comes with real human examples, correlates with product goals.

Cons: Sparse and self-selected, skews toward severe issues, users rarely explain failure reasons, not automated, relying primarily on it risks negative user impact.

Manual Transcript Review

Pros: Builds failure-mode intuition, catches subtle quality issues automated checks miss, clarifies definitions of "good."

Cons: Time-intensive, does not scale, inconsistent coverage, reviewer fatigue affects signal quality, typically provides qualitative rather than clear quantitative grading.

Systematic Human Studies

Pros: Gold-standard quality judgments from multiple raters, handles subjective/ambiguous tasks, improves model-grader signals.

Cons: Relatively expensive with slow turnaround, runs infrequently, must reconcile inter-rater disagreement, complex domains require expert raters.

The Swiss Cheese Model

These methods map to different development stages. Combined methods work best -- like the Swiss Cheese Model from safety engineering: no single evaluation layer catches everything, but multiple methods mean failures slipping through one layer get caught by another.

The most effective teams combine automated evals for fast iteration, production monitoring for ground truth, and periodic human review for calibration.

Chapter 13: Conclusion

Teams without evals get trapped in reactive loops -- fixing failures that create others, unable to distinguish real regressions from noise. Teams investing early find the opposite: acceleration as failures become test cases, preventing regressions, and metrics replacing guesswork. Evals give teams clear objectives, turning "the agent feels worse" into actionable items. Value compounds but requires treating evals as core infrastructure, not afterthoughts.

Patterns vary by agent type, but fundamentals remain constant:

Start early without waiting for perfection
Source realistic failures from real usage
Define unambiguous, robust success criteria
Design thoughtful graders combining multiple types
Ensure problem difficulty challenges models
Iterate to improve signal-to-noise ratio
Read the transcripts

AI agent evaluation remains nascent and fast-evolving. As agents tackle longer tasks, collaborate in multi-agent systems, and handle increasingly subjective work, techniques will continue to adapt.

Appendix: Eval Frameworks

Several open-source and commercial frameworks help implement agent evaluations without building infrastructure from scratch. Choice depends on agent type, existing stack, and offline/production observability needs.

Harbor: Suits containerized agent environments with cloud-provider-scale trial infrastructure and standardized task/grader formats. Popular benchmarks like Terminal-Bench 2.0 ship through Harbor registries.
Promptfoo: Lightweight, flexible, open-source declarative YAML configuration for prompt testing with assertion types from string matching to LLM-as-judge rubrics. Anthropic uses Promptfoo versions for product evals.
Braintrust: Combines offline evaluation with production observability and experiment tracking. Useful for teams needing both development iteration and production quality monitoring. Its autoevals library includes pre-built factuality and relevance scorers.
LangSmith: Provides tracing, offline/online evaluations, and dataset management with tight LangChain ecosystem integration.
Langfuse: Offers similar capabilities as a self-hosted open-source alternative for teams with data-residency requirements.

Many teams combine multiple tools, roll custom frameworks, or use simple evaluation scripts as starting points. While frameworks accelerate progress and standardize approaches, quality ultimately depends on eval tasks themselves. Picking frameworks that fit your workflow, then investing in high-quality test cases through iteration, typically works best.

Demystifying Evals for AI Agents

About

Preview