Companion to Zero to Eval ->
OpenNash reference atlas · source-first eval benchmarks

OpenNash's AI Evals Atlas

A practical map of AI benchmarks for knowledge work: LLMs, agents, RAG, coding, legal, medical, finance, CX, science, multimodal, voice, and safety. Use it to find the right public signal before spending weeks testing the wrong model, harness, or workflow.

114 benchmarks Runnable links Leaderboards Datasets + repos
00

What Belongs In The Atlas

A benchmark is useful when you can trace it back to a task

Use the atlas to answer one question first: does this benchmark resemble the work your agent is supposed to do?

The atlas is organized by work shape, not only by model capability. A legal research benchmark, a voice support benchmark, and a browser-agent benchmark fail for different reasons, so they should not be mashed into one universal leaderboard without context.

The goal is triage. A team building a claims agent should not start with the same signal as a team building an IDE agent, a clinical assistant, or a diligence workflow. The right public benchmark narrows the model and harness choices before private eval design begins.

Layer 01

Scoreboards

Model comparison layers such as Artificial Analysis, LMArena, HELM, OpenCompass, and Hugging Face leaderboards.

Layer 02

Runnable Sets

Datasets and harnesses your team can run offline or adapt into CI, such as SWE-bench, LegalBench, FinanceBench, and HealthBench.

Layer 03

Agent Worlds

Interactive environments that test tool use, browsing, APIs, voice, desktop work, policy compliance, and long-horizon execution.

Layer 04

Domain Suites

Specialized collections for healthcare, legal, finance, science, CX, voice, security, and document-heavy work.

01

Best Benchmarks By Use Case

Fast answers for builders choosing eval signals

Use these as starting points. The right benchmark is the one whose task shape, scaffold, and failure modes resemble your product.

Coding agents

Use SWE-bench plus Terminal-Bench.

SWE-bench tests real repository fixes. Terminal-Bench tests shell-native task execution. Add LiveCodeBench, Aider Polyglot, and BigCodeBench for fresh coding and practical edit coverage.

See coding benchmarks ->
Customer support

Use tau-bench for policy and tool reliability.

Start with tau-bench, tau2/tau3, tau-voice, CRMArena, WorkArena, and BFCL. Support agents fail on policy, state, tools, handoffs, and final customer outcome.

Open CX stack ->
Legal agents

Use Harvey LAB plus legal RAG checks.

Harvey LAB is the stronger agentic legal signal. Pair it with LegalBench, LegalBench-RAG, LexGLUE, LawBench, and jurisdiction-specific citation checks.

Open legal stack ->
Healthcare

Separate conversation, workflow, and safety.

Use HealthBench for clinical conversations, MedHELM for holistic medical evaluation, MedAgentBench for EHR workflows, MedQA for knowledge, and MedSafetyBench for risk.

Open healthcare stack ->
RAG systems

Evaluate retrieval before generation.

Retrieval is a search problem: use Recall@k, Precision@k, and MRR. Then evaluate answer faithfulness, citation quality, and whether the response actually addresses the question.

Read RAG guidance ->
Production agents

Public scores only shortlist models.

Use benchmarks to choose candidates. Use private traces, binary checks, online monitoring, and handoff review to decide whether a production agent is safe to ship.

Read Zero to Eval ->
02

Live Model Value Board

Artificial Analysis quality, coding, speed, and price

The board uses Artificial Analysis model quality, coding, speed, latency, and price metrics. Rows with incomplete public metrics are filtered out.

Model quality, speed, and price loading live data
Model Quality Coding Speed TTFT Blended price Input / output Quality per $1M

Source: Artificial Analysis. Rows require published quality, coding, speed, latency, and price metrics.

03

Opinionated Benchmark Stacks

The highest-ROI benchmark menu by domain

Each tab groups the benchmark signals a builder should inspect first: scoreboards, domain datasets, runnable harnesses, and workflow environments.

04

Build An Eval Stack

From benchmark scouting to release checks

The stack builder translates a domain into the benchmark layers a team should inspect before designing private evals.

CX agent eval stack

A practical starting point for support agents that must follow policy, use tools, handle voice or chat, and resolve the customer need.

05

The Full Benchmark Atlas

Use the table after you have picked a domain stack

The full table is for source-hunting and deeper comparison. The stack above is the higher-ROI starting point.

Atlas controls0 shown
Benchmark Domain Type What it tests Best for Sources
No benchmarks match those filters. Clear a control or search a broader domain.
06

How Evals Work In Practice

From public benchmark to production gate

A good eval stack separates model selection from system reliability, then keeps testing after the agent is live.

01

Use public benchmarks to shortlist

Start broad: quality, coding, speed, cost, and domain-specific leaderboards narrow the model field before you run expensive simulations.

02

Match the unit of work

Pick the benchmark that resembles your task: final database state for CX, test-passing repo edits for coding, cited retrieval for legal and finance, clinical rubrics for healthcare.

03

Prefer runnable harnesses

A GitHub repo, dataset, or Hugging Face split can become a repeatable offline gate. A hosted leaderboard is useful for scouting, but harder to make part of release quality.

04

Measure the whole agent

For deployed agents, the model is only one component. Retrieval, tools, policies, prompts, retries, memory, handoffs, latency, and user behavior all change the outcome.

05

Run offline and online evals

Offline evals catch regressions before release. Online evals monitor real traffic, escalation quality, human corrections, cost spikes, and silent failure modes.

06

Keep evidence attached

Every score should point back to a task, dataset, paper, or trace. Without source evidence, a composite leaderboard becomes a story instead of a control.

What is the minimum viable eval setup?

Start with trace review, not infrastructure. Read 20 to 50 real outputs after a meaningful change, write short notes on what broke, group those notes into failure categories, and turn the most frequent high-impact failures into binary checks. For a serious cycle, review roughly 100 fresh traces or continue until new traces stop revealing new failure types.

When should I use an LLM-as-judge?

Use deterministic checks first: schema validation, exact match, tool result assertions, citation presence, policy thresholds, execution tests, or regex. Use an LLM judge when the failure is important, subjective, and recurring enough to justify validation against human labels.

How often should production evals run?

Run offline gates before release. Re-run error analysis after model swaps, prompt changes, product changes, major bug fixes, incidents, complaint spikes, or metric drift. Between larger cycles, review a small weekly sample plus outliers such as long sessions, retries, escalations, and unusually expensive traces.

How do I evaluate RAG?

Split retrieval from generation. Retrieval uses search metrics like Recall@k, Precision@k, and MRR on query-document pairs. Generation uses task-specific checks for groundedness, citation quality, answer relevance, refusal behavior, and domain-specific mistakes such as wrong jurisdiction, wrong dosage, or stale filing data.

How do I evaluate handoffs and agents?

Score the whole session first: did the user goal get resolved? Then inspect the first upstream failure. For human handoffs, check whether the handoff was necessary, timely, context-rich, and resolved the user need. For agent workflows, add step-level diagnostics only where end-to-end failures concentrate.

Shipping production agents?

OpenNash helps teams launch 24/7 agents with online and offline evals, human-in-the-loop review, durable execution, escalation paths, and production monitoring so agents do not fail quietly after the demo.