Use SWE-bench plus Terminal-Bench.
SWE-bench tests real repository fixes. Terminal-Bench tests shell-native task execution. Add LiveCodeBench, Aider Polyglot, and BigCodeBench for fresh coding and practical edit coverage.
See coding benchmarks ->A practical map of AI benchmarks for knowledge work: LLMs, agents, RAG, coding, legal, medical, finance, CX, science, multimodal, voice, and safety. Use it to find the right public signal before spending weeks testing the wrong model, harness, or workflow.
Use the atlas to answer one question first: does this benchmark resemble the work your agent is supposed to do?
The atlas is organized by work shape, not only by model capability. A legal research benchmark, a voice support benchmark, and a browser-agent benchmark fail for different reasons, so they should not be mashed into one universal leaderboard without context.
The goal is triage. A team building a claims agent should not start with the same signal as a team building an IDE agent, a clinical assistant, or a diligence workflow. The right public benchmark narrows the model and harness choices before private eval design begins.
Model comparison layers such as Artificial Analysis, LMArena, HELM, OpenCompass, and Hugging Face leaderboards.
Datasets and harnesses your team can run offline or adapt into CI, such as SWE-bench, LegalBench, FinanceBench, and HealthBench.
Interactive environments that test tool use, browsing, APIs, voice, desktop work, policy compliance, and long-horizon execution.
Specialized collections for healthcare, legal, finance, science, CX, voice, security, and document-heavy work.
Use these as starting points. The right benchmark is the one whose task shape, scaffold, and failure modes resemble your product.
SWE-bench tests real repository fixes. Terminal-Bench tests shell-native task execution. Add LiveCodeBench, Aider Polyglot, and BigCodeBench for fresh coding and practical edit coverage.
See coding benchmarks ->Start with tau-bench, tau2/tau3, tau-voice, CRMArena, WorkArena, and BFCL. Support agents fail on policy, state, tools, handoffs, and final customer outcome.
Open CX stack ->Harvey LAB is the stronger agentic legal signal. Pair it with LegalBench, LegalBench-RAG, LexGLUE, LawBench, and jurisdiction-specific citation checks.
Open legal stack ->Use HealthBench for clinical conversations, MedHELM for holistic medical evaluation, MedAgentBench for EHR workflows, MedQA for knowledge, and MedSafetyBench for risk.
Open healthcare stack ->Retrieval is a search problem: use Recall@k, Precision@k, and MRR. Then evaluate answer faithfulness, citation quality, and whether the response actually addresses the question.
Read RAG guidance ->Use benchmarks to choose candidates. Use private traces, binary checks, online monitoring, and handoff review to decide whether a production agent is safe to ship.
Read Zero to Eval ->The board uses Artificial Analysis model quality, coding, speed, latency, and price metrics. Rows with incomplete public metrics are filtered out.
| Model | Quality | Coding | Speed | TTFT | Blended price | Input / output | Quality per $1M |
|---|
Source: Artificial Analysis. Rows require published quality, coding, speed, latency, and price metrics.
Each tab groups the benchmark signals a builder should inspect first: scoreboards, domain datasets, runnable harnesses, and workflow environments.
The stack builder translates a domain into the benchmark layers a team should inspect before designing private evals.
A practical starting point for support agents that must follow policy, use tools, handle voice or chat, and resolve the customer need.
The full table is for source-hunting and deeper comparison. The stack above is the higher-ROI starting point.
| Benchmark | Domain | Type | What it tests | Best for | Sources |
|---|
A good eval stack separates model selection from system reliability, then keeps testing after the agent is live.
Start broad: quality, coding, speed, cost, and domain-specific leaderboards narrow the model field before you run expensive simulations.
Pick the benchmark that resembles your task: final database state for CX, test-passing repo edits for coding, cited retrieval for legal and finance, clinical rubrics for healthcare.
A GitHub repo, dataset, or Hugging Face split can become a repeatable offline gate. A hosted leaderboard is useful for scouting, but harder to make part of release quality.
For deployed agents, the model is only one component. Retrieval, tools, policies, prompts, retries, memory, handoffs, latency, and user behavior all change the outcome.
Offline evals catch regressions before release. Online evals monitor real traffic, escalation quality, human corrections, cost spikes, and silent failure modes.
Every score should point back to a task, dataset, paper, or trace. Without source evidence, a composite leaderboard becomes a story instead of a control.
Start with trace review, not infrastructure. Read 20 to 50 real outputs after a meaningful change, write short notes on what broke, group those notes into failure categories, and turn the most frequent high-impact failures into binary checks. For a serious cycle, review roughly 100 fresh traces or continue until new traces stop revealing new failure types.
Use deterministic checks first: schema validation, exact match, tool result assertions, citation presence, policy thresholds, execution tests, or regex. Use an LLM judge when the failure is important, subjective, and recurring enough to justify validation against human labels.
Run offline gates before release. Re-run error analysis after model swaps, prompt changes, product changes, major bug fixes, incidents, complaint spikes, or metric drift. Between larger cycles, review a small weekly sample plus outliers such as long sessions, retries, escalations, and unusually expensive traces.
Split retrieval from generation. Retrieval uses search metrics like Recall@k, Precision@k, and MRR on query-document pairs. Generation uses task-specific checks for groundedness, citation quality, answer relevance, refusal behavior, and domain-specific mistakes such as wrong jurisdiction, wrong dosage, or stale filing data.
Score the whole session first: did the user goal get resolved? Then inspect the first upstream failure. For human handoffs, check whether the handoff was necessary, timely, context-rich, and resolved the user need. For agent workflows, add step-level diagnostics only where end-to-end failures concentrate.
OpenNash helps teams launch 24/7 agents with online and offline evals, human-in-the-loop review, durable execution, escalation paths, and production monitoring so agents do not fail quietly after the demo.