OpenNash research · source-backed reference

AI Evaluation Benchmark Atlas

Name: OpenNash AI Evaluation Benchmark Atlas dataset
Creator: OpenNash Research
License: https://opennash.com/terms/

Compare 159 public benchmarks and explore 757 reported model results across agents, coding, retrieval, regulated work, science, multimodal systems, and safety. Every record keeps its source path and verification date.

159 benchmarks 11 domains 346 model IDs JSON + CSV Updated 2026-07-16

Start With The Decision

Model, system, or production behavior?

A benchmark only helps when its unit of work resembles the decision you are trying to make.

Choose a model

Scout broad capability

Use transparent scoreboards and fresh capability sets to narrow the field by quality, speed, price, coding, and domain signal.

See model signals →

Choose a harness

Match the real task

Prefer runnable repositories, datasets, deterministic end states, and environments that resemble the tools and permissions you deploy.

Choose a domain →

Ship reliably

Build private evals

Convert production traces and incidents into release gates. Public scores cannot measure your prompts, data, policies, retries, or handoffs.

Use the operating method →

Current Model Signals

Live comparison, reference snapshot as fallback

These are scouting signals from Artificial Analysis. Check model snapshot, reasoning settings, price assumptions, and benchmark methodology before deciding.

Model quality, speed, and pricereference snapshot · 2026-07-15

Model	Quality	Coding	Speed	TTFT	Blended price	Input / output	Quality per $1M
1. Claude Opus 4.8 (max) reference snapshot	61	n/a	n/a	n/a	n/a	n/a	n/a
2. GPT-5.5 (xhigh) reference snapshot	60	n/a	n/a	n/a	n/a	n/a	n/a
3. Gemini 3.1 Pro Preview reference snapshot	57	n/a	n/a	n/a	n/a	n/a	n/a
4. MiniMax-M3 reference snapshot	55	n/a	n/a	n/a	n/a	n/a	n/a
5. Kimi K2.6 reference snapshot	54	n/a	n/a	n/a	n/a	n/a	n/a
6. MiMo-V2.5-Pro reference snapshot	54	n/a	n/a	n/a	n/a	n/a	n/a
7. Grok 4.3 (high) reference snapshot	53	n/a	n/a	n/a	n/a	n/a	n/a
8. Muse Spark reference snapshot	52	n/a	n/a	n/a	n/a	n/a	n/a
9. DeepSeek V4 Pro (Max) reference snapshot	52	n/a	n/a	n/a	n/a	n/a	n/a
10. Nemotron 3 Ultra reference snapshot	48	n/a	n/a	n/a	n/a	n/a	n/a
11. gpt-oss-120b (high) reference snapshot	33	n/a	n/a	n/a	n/a	n/a	n/a

Source: Artificial Analysis API. A dated reference snapshot remains visible if live data is unavailable.

Choose A Model For The Work

Evidence-aware benchmark aggregation · July 2026

The score is relative performance inside the selected benchmark set. Coverage and confidence stay separate so a one-benchmark winner cannot masquerade as a universal best model.

Catalog159source-linked benchmarks

Scored layer34official registry datasets

Submissions757all configurations retained

Model IDs346exact IDs, never merged

Read this first: this is a reported-system comparison, not a base-model IQ table. Tools, prompts, reasoning modes, scaffolds, budgets, and frame counts can differ. The default view requires evidence from at least two benchmarks.

Current decision

Reasoning & knowledge

Hard questions, math, factuality, and structured instruction following.

static snapshot · loading full evidence

Relative performancewithin selected evidence · higher is better

Rank	Score	Coverage	Evidence
1	100	27%	directional2 benchmarks
2	97.6	63%	moderate5 benchmarks
3	95.5	27%	directional2 benchmarks
4	86.4	24%	directional2 benchmarks
5	85.4	77%	moderate6 benchmarks
6	84.5	66%	moderate5 benchmarks
7	84.3	25%	directional2 benchmarks
8	84.2	53%	moderate4 benchmarks
9	81.8	41%	directional3 benchmarks
10	81.6	41%	directional3 benchmarks

Score vs evidence coverageupper-right is stronger and better covered

Download all score records · JSON Download default rankings · CSV Audit the formula Official registry source

The evidence-filtered Reasoning & knowledge table is rendered above. Enable JavaScript to change use cases, customize the benchmark mix, draw the evidence plot, and inspect per-model provenance.

Browse By Evaluation Domain

Stable categories, secondary tags in the data

Each hub explains what to measure, how to interpret public results, and which source-backed benchmarks belong in the shortlist.

General Models

Broad model quality, reasoning, factuality, instruction following, and meta-evaluation.

Open domain guide → 30

Agents & Tool Use

Tool calling, browsers, computer use, office work, and long-horizon agent behavior.

Open domain guide → 17

Coding & SWE

Repository repair, shell tasks, code generation, debugging, and long-horizon software work.

Open domain guide → 13

Retrieval & RAG

Retrieval, embeddings, long context, document AI, grounded generation, and citations.

Open domain guide → 5

Customer Support

Support resolution, CRM workflows, handoffs, voice CX, policy, and grounded answers.

Open domain guide → 11

Legal

Legal reasoning, research, document review, retrieval, and jurisdiction-sensitive tasks.

Open domain guide → 9

Finance

Filings, financial reasoning, professional work, extraction, and evidence-backed analysis.

Open domain guide → 15

Healthcare

Clinical knowledge, biomedical reasoning, safety, medical agents, and decision support.

Open domain guide → 19

Science & Reasoning

Math, physics, biology, research, scientific coding, and complex verifiable reasoning.

Open domain guide → 16

Multimodal & Voice

Images, video, audio, speech, multimodal browsing, and voice-agent behavior.

Open domain guide → 11

Safety & Security

Harm, misuse, cyber capability, policy compliance, and dangerous capability evaluation.

Open domain guide →

Opinionated Benchmark Stacks

A higher-ROI menu by domain

Each stack combines a scouting signal, a domain set, a runnable harness, and a workflow-level environment.

CX stack

Customer support and voice agents

Highest ROI for support builders: use live model quality/value first, then benchmark policy-following, tool use, customer state changes, CRM workflows, and knowledge retrieval.

Live model board

Artificial Analysis

Quality, speed, coding, and price shortlist before expensive support simulations.

Source →

Support benchmark

tau-bench

Retail and airline customer-service agents with policies, tools, users, and state checks.

Source →

Support benchmark

tau2 / tau3-bench

Expanded tau domains, task fixes, knowledge, and voice-ready customer-service evaluation.

Source →

Knowledge support

tau-knowledge

Support agents that must retrieve and use unstructured policy knowledge.

Source →

CRM workflow

CRMArena

CRM-style service-agent, analyst, and operations workflows.

Source →

Enterprise workflow

WorkArena / WorkArena++

ServiceNow-style enterprise workflows for business-process agents.

Source →

Tool reliability

BFCL / ToolSandbox

Function calling, multi-turn tool use, and stateful tool execution.

Source →

Build An Eval Stack

From benchmark scouting to release checks

Use this recipe to translate a public benchmark category into the layers a production system needs.

CX agent eval stack

For support agents that must follow policy, use tools, resolve customer state, and hand off cleanly.

Choose a domain

Model shortlist

Artificial Analysis

Shortlist models by quality, coding/tool aptitude, speed, latency, and cost.

Domain benchmark

tau-bench / tau2 / tau-voice

Test policy following, tool use, user simulation, voice readiness, and final state changes.

Workflow layer

CRMArena / WorkArena

Exercise CRM and enterprise service workflows where process state matters.

Private gate

Top support traces

Turn refunds, cancellations, escalations, missing data, and policy conflicts into pass/fail tests.

Online monitor

Handoffs + outcomes

Track resolution, handoff timing, context transfer, retries, cost, latency, and complaints.

The Full Benchmark Atlas

Filter the normalized, source-backed collection

A “recommended” label means useful for shortlisting or harness discovery. It is editorial guidance, not a quality score or endorsement.

Catalog · JSON Catalog · CSV Scores + provenance · JSON Rankings · CSV Read data definitions

Benchmark	Domain	Type	What it tests	Best for	Sources
Artificial Analysis recommendedpartial Verified 2026-07-15	General Models Meta	scoreboard current · mixed	Model intelligence, coding, math, science, speed, latency, and pricing across commercial and open models.	Model shortlist and cost-performance tradeoffs	source leaderboard
Benchmark Health Index recommendedrunnable Verified 2026-07-15	General Models Benchmark quality auditing	toolkit emerging · open	Benchmarks themselves across discrimination, anti-saturation, and ecosystem impact.	Choosing which public benchmarks still carry useful signal.	repo paper
Benchmark² specializedpartial Verified 2026-07-15	General Models Benchmark validity meta-evaluation	toolkit emerging · open	Benchmark quality through cross-ranking consistency, discriminability, and capability alignment.	Auditing whether an evaluation measures the capability its label implies.	paper
EleutherAI LM Evaluation Harness recommendedrunnable Verified 2026-07-15	General Models Meta	harness current · open	Standardized runner for many classic language-model tasks and benchmark suites.	CI-style baseline model evaluation	repo
HELM recommendedrunnable Verified 2026-07-15	General Models Meta	harness current · open	Holistic evaluation framework with transparent scenarios, metrics, prompts, and model predictions.	Academic reproducibility and broad model audits	source repo
Hugging Face Open LLM Leaderboard referencepartial Verified 2026-07-15	General Models Meta	scoreboard legacy · mixed	Open-weight model leaderboard and result archive; useful historically, but check freshness before treating it as current.	Open model comparison and historical baselines	leaderboard source
IFEval recommendedrunnable Verified 2026-07-15	General Models Instruction following	benchmark current · open	Instruction-following benchmark with verifiable formatting and constraint adherence.	Production assistants with strict output rules	dataset
IFStruct v1.0 specializedrunnable Verified 2026-07-15	General Models Structured instruction following	benchmark emerging · open	Whether models satisfy compositional structural constraints in JSON and YAML outputs.	Comparing schema compliance before testing private structured-output contracts.	dataset results
LMArena / Chatbot Arena recommendedhosted Verified 2026-07-15	General Models Meta	scoreboard current · hosted	Blind pairwise human preference for chat, coding, and style-controlled model comparisons.	Human preference and conversational quality	source dataset
OpenCompass / CompassRank recommendedrunnable Verified 2026-07-15	General Models Meta	harness current · open	Open evaluation platform across many datasets with public and private benchmark dimensions.	Broad open-source evaluation runs	repo leaderboard
SimpleQA recommendedrunnable Verified 2026-07-15	General Models Factuality	benchmark current · open	Short factual questions with clear answers for hallucination and factual recall checks.	Factuality and hallucination screening	repo source
Toloka Arena commercialhosted Verified 2026-07-15	General Models Meta	scoreboard current · commercial	Hosted agentic-intelligence leaderboard with composite pass rates and enterprise-domain datasets.	Commercial agent benchmark comparison	leaderboard
TruthfulQA referencerunnable Verified 2026-07-15	General Models Factuality	benchmark legacy · open	Questions designed to test whether models repeat common falsehoods.	Truthfulness baseline and regression testing	repo
ACE watchpartial Verified 2026-07-15	Agents & Tool Use Everyday agent capability	benchmark emerging · open	Agent capability across consumer task areas such as shopping, food, gaming, and DIY.	Tracking an emerging cross-domain agent benchmark; inspect task coverage before use.	dataset results
AgentBench referencerunnable Verified 2026-07-15	Agents & Tool Use Agents	agent benchmark legacy · open	LLM agent evaluation across multiple interactive environments.	General agent benchmarking	repo
APEX v1 Extended specializedrunnable Verified 2026-07-15	Agents & Tool Use Economically valuable professional tasks	benchmark emerging · open	Agent performance on extended economically valuable tasks spanning multiple jobs.	Testing whether professional-work rankings survive broader task coverage.	dataset results
APEX-Agents specializedrunnable Verified 2026-07-15	Agents & Tool Use Long-horizon professional work	benchmark emerging · open	Cross-application completion of long-horizon professional-services tasks.	Shortlisting agent systems for realistic knowledge-work workflows.	dataset results
AstaBench recommendedrunnable Verified 2026-07-15	Agents & Tool Use Science / Agents	agent benchmark current · open	Scientific research-agent suite covering literature understanding, code execution, data analysis, and end-to-end discovery workflows.	Scientific research agents	source repo
Berkeley Function Calling Leaderboard / BFCL recommendedrunnable Verified 2026-07-15	Agents & Tool Use Tool use	benchmark current · open	Executable function-calling tests including multi-turn and tool-use scenarios.	Tool calling and API accuracy	leaderboard repo
BrowseComp recommendedrunnable Verified 2026-07-15	Agents & Tool Use Browser agents	benchmark current · open	Browsing-agent benchmark for hard-to-find web answers that require search and synthesis.	Web browsing agents	source repo
CCTU specializedrunnable Verified 2026-07-15	Agents & Tool Use Tool use under complex constraints	benchmark emerging · open	Tool use across 200 cases containing resource, behavior, toolset, and response constraints.	Finding constraint violations that task-success-only agent evals miss.	repo paper
Claw Bench specializedrunnable Verified 2026-07-15	Agents & Tool Use Cross-domain agent products	benchmark emerging · open	Agent products on 314 reproducible tasks across 33 domains and four difficulty levels.	Regression testing general-purpose agents with inspectable end-state verifiers.	repo leaderboard
Claw-Eval specializedrunnable Verified 2026-07-15	Agents & Tool Use Real-world agent evaluation	benchmark emerging · open	Practical task completion by tool-using agent systems.	A current, source-backed signal for open agent frameworks.	dataset results
GAIA recommendedpartial Verified 2026-07-15	Agents & Tool Use Agents	agent benchmark current · mixed	Real-world assistant tasks requiring reasoning, browsing, multimodality, and tool use.	General agent task solving	dataset leaderboard
MCP-Atlas recommendedrunnable Verified 2026-07-15	Agents & Tool Use Agents / Tool use	agent benchmark current · open	Large-scale tool-use benchmark over real MCP servers and multi-call tasks.	Production-style MCP tool agents	repo leaderboard paper
Mind2Web recommendedrunnable Verified 2026-07-15	Agents & Tool Use Browser agents	benchmark current · open	Web navigation tasks across many real websites and domains.	Web-agent planning and UI grounding	repo
OfficeBench watchpartial Verified 2026-07-15	Agents & Tool Use Office agents	agent benchmark emerging · mixed	Office-document, email, calendar, and productivity-task automation.	Knowledge-worker office workflows	repo
OSWorld recommendedrunnable Verified 2026-07-15	Agents & Tool Use Computer use	agent benchmark current · open	Desktop/web/app tasks with execution scripts in realistic computer environments.	Computer-use agents	source repo
PaperBench recommendedrunnable Verified 2026-07-15	Agents & Tool Use Science / Agents	agent benchmark current · open	Research-replication benchmark for agents attempting to reproduce AI papers and artifacts.	Research replication agents	source repo
PM-Bench specializedrunnable Verified 2026-07-15	Agents & Tool Use Prospective memory in agents	benchmark emerging · open	Whether agents remember and execute delayed intentions during an ongoing simulated week.	Testing assistants that must reliably follow up when future cues occur.	paper
ResearchClawBench specializedrunnable Verified 2026-07-15	Agents & Tool Use Autonomous research agents	benchmark emerging · open	Automated research agents on rediscovery and new-discovery workflows.	Comparing research-agent scaffolds rather than chat-model knowledge alone.	dataset results
runescape-bench / runebench specializedpartial Verified 2026-07-15	Agents & Tool Use Emerging / Agents	agent benchmark current · mixed	Game-world agent benchmark using RuneScape-like task environments.	Long-horizon game agents	repo results
SkillsBench specializedrunnable Verified 2026-07-15	Agents & Tool Use Reusable agent skills	benchmark emerging · open	Whether packaged skills improve agent performance across diverse task environments.	Evaluating skill libraries, harness design, and portable agent procedures.	dataset results
SOP-Bench specializedrunnable Verified 2026-07-15	Agents & Tool Use Industrial standard operating procedures	benchmark emerging · mixed	Agents on thousands of multi-step procedures across industrial domains.	Evaluating process adherence, completion, and tool accuracy in operations workflows.	repo paper
TheAgentCompany watchpartial Verified 2026-07-15	Agents & Tool Use Enterprise agents	agent benchmark emerging · mixed	Digital-worker benchmark involving browsing, coding, programs, and simulated coworkers.	Professional task automation	source
ToolBench / ToolLLM referencerunnable Verified 2026-07-15	Agents & Tool Use Tool use	benchmark legacy · open	API-use tasks for tool-augmented LLMs across large tool collections.	Tool-use research baselines	repo
ToolSandbox recommendedrunnable Verified 2026-07-15	Agents & Tool Use Tool use	agent benchmark current · open	Stateful conversational tool-use tasks with tool execution and environment state.	Robust multi-turn tool agents	repo
TRAJECT-Bench specializedpartial Verified 2026-07-15	Agents & Tool Use Trajectory-aware tool use	benchmark emerging · open	Agent tool use with evaluation of the intermediate trajectory as well as the final answer.	Diagnosing how an agent arrived at an outcome, not only whether it succeeded.	source
VisualWebArena recommendedrunnable Verified 2026-07-15	Agents & Tool Use Browser agents	agent benchmark current · open	Browser tasks requiring visual understanding of websites and UI state.	Multimodal web agents	repo
WebArena recommendedrunnable Verified 2026-07-15	Agents & Tool Use Browser agents	agent benchmark current · open	Self-hosted realistic websites and natural-language browser tasks.	Browser automation agents	source repo
WildClawBench specializedrunnable Verified 2026-07-15	Agents & Tool Use Real-world autonomous agent work	benchmark emerging · open	Agent performance on diverse real-world tasks and environments.	Comparing general-purpose agent systems on operational work.	dataset results
WorkArena / WorkArena++ recommendedrunnable Verified 2026-07-15	Agents & Tool Use Enterprise agents	agent benchmark current · open	Enterprise workflow tasks modeled in ServiceNow-style environments.	Business process agents	repo
YC-Bench specializedrunnable Verified 2026-07-15	Agents & Tool Use Startup CEO simulation	benchmark emerging · open	Agent decision-making across a simulated year of startup operations.	Exploring long-horizon business-agent behavior and trade-offs.	dataset results
Aider Polyglot recommendedpartial Verified 2026-07-15	Coding & SWE Coding	benchmark current · mixed	Multi-language code editing and test-passing benchmark for practical coding assistants.	Code edit models across languages	leaderboard
ALE-Bench specializedrunnable Verified 2026-07-15	Coding & SWE Emerging / Coding	benchmark current · open	Long-horizon algorithm-engineering contest tasks.	Optimization and algorithm engineering agents	repo
BigCodeBench recommendedrunnable Verified 2026-07-15	Coding & SWE Coding	benchmark current · open	Software-engineering-oriented code-generation tasks designed to go beyond HumanEval and MBPP.	Modern code-generation baselines	repo source
CursorBench commercialhosted Verified 2026-07-15	Coding & SWE Emerging / Coding	benchmark current · commercial	Cursor's proprietary/internal offline eval suite from real Cursor sessions, focused on correctness, code quality, efficiency, and interaction behavior.	Editor-agent evaluation context	source
DeepSWE watchrunnable Verified 2026-07-15	Coding & SWE Emerging / SWE	agent benchmark emerging · open	Original long-horizon software engineering tasks across several languages with isolated environments and verifiers.	Emerging SWE agent evaluation	repo dataset results
GBA-Eval specializedpartial Verified 2026-07-15	Coding & SWE Emerging / Coding	benchmark current · mixed	Single high-quality long-horizon Game Boy Advance SWE eval case; useful signal, not a complete coding benchmark.	Experimental coding-agent signal	source post
HumanEval referencerunnable Verified 2026-07-15	Coding & SWE Coding	benchmark legacy · open	Classic Python function synthesis benchmark with unit tests.	Legacy coding baseline	repo
LiveCodeBench recommendedrunnable Verified 2026-07-15	Coding & SWE Coding	benchmark current · open	Recent contest-style coding problems with contamination-aware releases.	Code-generation and algorithmic coding signal	repo
MBPP referencerunnable Verified 2026-07-15	Coding & SWE Coding	benchmark legacy · open	Mostly Basic Programming Problems for simple Python coding tasks.	Small-model and baseline coding checks	dataset
NVIDIA ComputeEval specializedrunnable Verified 2026-07-15	Coding & SWE CUDA correctness and performance	benchmark emerging · open	Correctness and runtime performance of generated GPU compute code.	Evaluating coding systems that optimize kernels or write CUDA.	dataset results
ProgramBench specializedrunnable Verified 2026-07-15	Coding & SWE Emerging / Coding	benchmark current · open	Rebuilding programs from compiled binaries and documentation.	Reverse engineering and deep coding	repo
SciCode recommendedrunnable Verified 2026-07-15	Coding & SWE Science / Coding	benchmark current · open	Scientist-curated coding problems from real natural-science contexts.	Scientific coding agents	repo
SWE-bench recommendedrunnable Verified 2026-07-15	Coding & SWE Coding / SWE	agent benchmark current · open	Real GitHub issues that require modifying repositories and passing tests.	Software engineering agent capability	source repo
SWE-bench Pro commercialhosted Verified 2026-07-15	Coding & SWE Coding / SWE	agent benchmark current · commercial	Harder held-out/private-repo software engineering tasks intended to reduce contamination.	Frontier SWE agents and benchmark saturation checks	leaderboard paper dataset results
SWE-bench Verified recommendedrunnable Verified 2026-07-15	Coding & SWE Coding / SWE	agent benchmark current · open	Human-filtered subset of SWE-bench with higher-quality real issue tasks.	Default SWE benchmark slice	dataset leaderboard results
SWE-Marathon watchpartial Verified 2026-07-15	Coding & SWE Emerging / SWE	agent benchmark emerging · mixed	Ultra-long-horizon software engineering tasks.	Long-horizon coding agents	source
Terminal-Bench 2.0 recommendedpartial Verified 2026-07-15	Coding & SWE Coding / Agents	agent benchmark current · mixed	Terminal-based task execution benchmark for agents working in shell environments.	CLI agents and software task execution	leaderboard source dataset results
ArguAna specializedrunnable Verified 2026-07-15	Retrieval & RAG Counterargument retrieval	benchmark current · open	Retrieval of counterarguments for a given argumentative claim.	Diagnosing semantic retrieval beyond topical similarity.	dataset results
BEIR recommendedrunnable Verified 2026-07-15	Retrieval & RAG Retrieval / RAG	benchmark current · open	Heterogeneous information-retrieval benchmark for zero-shot retrieval across many datasets and domains.	Retriever selection for RAG systems	repo paper
BRIGHT specializedrunnable Verified 2026-07-15	Retrieval & RAG Reasoning-intensive retrieval	benchmark emerging · open	Retrieval where finding relevant evidence requires multi-step reasoning.	Selecting retrievers for difficult research and professional search tasks.	dataset results
CRAG recommendedrunnable Verified 2026-07-15	Retrieval & RAG RAG	benchmark current · open	Comprehensive RAG benchmark with factual QA and mock APIs for retrieval.	RAG factuality and retrieval stress tests	repo paper
DocVQA recommendedpartial Verified 2026-07-15	Retrieval & RAG Document AI	benchmark current · mixed	Visual question answering over document images.	PDF, OCR, and document-agent evaluation	source
InfiniteBench recommendedrunnable Verified 2026-07-15	Retrieval & RAG Long context	benchmark current · open	Super-long-context benchmark beyond 100k tokens.	Context-window stress testing	repo
LongBench / LongBench v2 recommendedrunnable Verified 2026-07-15	Retrieval & RAG Long context	benchmark current · open	Long-context understanding and reasoning across documents and realistic multitask scenarios.	Long-context model screening	repo source
MDPBench specializedrunnable Verified 2026-07-15	Retrieval & RAG Multilingual document parsing	benchmark emerging · open	Real-world document parsing across languages, layouts, and content structures.	Comparing document intelligence pipelines serving multilingual corpora.	dataset results
MTEB recommendedrunnable Verified 2026-07-15	Retrieval & RAG Retrieval / Embeddings	benchmark current · open	Massive Text Embedding Benchmark for comparing embedding models across retrieval, clustering, classification, reranking, and semantic similarity tasks.	Embedding and retrieval model selection	source repo
olmOCR-bench specializedrunnable Verified 2026-07-15	Retrieval & RAG PDF OCR and extraction	benchmark emerging · open	OCR fidelity across diverse PDF pages using thousands of document-level unit tests.	Choosing extraction components before evaluating downstream document RAG.	dataset results
OmniDocBench recommendedrunnable Verified 2026-07-15	Retrieval & RAG Document AI	benchmark current · open	Document parsing benchmark for OCR, layout, table, formula, and reading-order extraction.	Document parsing for RAG pipelines	repo
ParseBench specializedrunnable Verified 2026-07-15	Retrieval & RAG Enterprise document parsing	benchmark emerging · open	Document parser accuracy on enterprise layouts and structured content.	Selecting a parser before measuring retrieval and grounded-answer quality.	dataset results
RAGBench recommendedrunnable Verified 2026-07-15	Retrieval & RAG RAG	benchmark current · open	Explainable RAG benchmark across documents, retrieval, generation, and attribution.	RAG system evaluation	paper dataset
CRMArena watchpartial Verified 2026-07-15	Customer Support CX / CRM	agent benchmark emerging · mixed	CRM workflows for service agents, analysts, and business operations.	CRM and customer-ops agents	source
tau-bench recommendedrunnable Verified 2026-07-15	Customer Support CX / Support	agent benchmark current · open	Customer-service agents in retail and airline domains using APIs and policy guidelines.	Support-agent reliability	source repo
tau-knowledge watchpartial Verified 2026-07-15	Customer Support CX / RAG	agent benchmark emerging · mixed	Knowledge-intensive support extension to the tau-bench family.	Support agents that retrieve policy knowledge	source
tau-voice recommendedrunnable Verified 2026-07-15	Customer Support Voice / CX	agent benchmark current · open	Full-duplex voice customer-service tasks scored against final database state.	Voice support agents	source paper
tau2-bench / tau3-bench recommendedrunnable Verified 2026-07-15	Customer Support CX / Support	agent benchmark current · open	Customer-service simulation framework with text, voice, policies, tools, multiple domains, and tau3 task-fix updates.	CX eval harness design	repo source tau3
Harvey BigLaw Bench commercialhosted Verified 2026-07-15	Legal Legal	scoreboard current · commercial	Harvey benchmark context for BigLaw-style legal tasks; useful market signal but not an open runnable benchmark.	Legal AI market context	source
Harvey Legal Agent Benchmark / LAB recommendedrunnable Verified 2026-07-15	Legal Legal	agent benchmark current · open	Open legal-agent benchmark with long-horizon tasks across practice areas and expert rubric criteria.	Agentic legal work product	repo source results
LawBench recommendedrunnable Verified 2026-07-15	Legal Legal	benchmark current · open	Legal tasks across entity recognition, reading comprehension, legal consultation, and more.	Chinese/legal reasoning research	repo
Legal RAG Bench specializedrunnable Verified 2026-07-15	Legal End-to-end legal research RAG	benchmark emerging · open	End-to-end legal retrieval and reasoning on realistic research tasks.	Comparing legal RAG systems where sources and reasoning both matter.	post
LegalBench recommendedrunnable Verified 2026-07-15	Legal Legal	benchmark current · open	162 legal reasoning tasks contributed by lawyers, law professors, researchers, and legal practitioners.	Legal reasoning baseline	source repo
LegalBench-RAG recommendedrunnable Verified 2026-07-15	Legal Legal / RAG	benchmark current · open	Legal retrieval and generation benchmark for end-to-end legal RAG systems.	Legal document retrieval and grounded answers	repo paper
LEXam specializedrunnable Verified 2026-07-15	Legal Swiss and international legal exams	benchmark emerging · open	Legal reasoning on hundreds of Swiss, EU, and international law examination questions.	Comparing legal knowledge across European and international jurisdictions.	dataset results
LexGLUE referencerunnable Verified 2026-07-15	Legal Legal	benchmark legacy · open	Legal NLP benchmark suite in a SuperGLUE-like format.	Classic legal NLP tasks	repo
PLawBench specializedrunnable Verified 2026-07-15	Legal Real-world legal practice	benchmark emerging · open	Legal consultation, case analysis, and document generation with workflow-grounded rubrics.	Evaluating legal-practice outputs rather than multiple-choice legal knowledge.	paper repo
RedlineBench specializedrunnable Verified 2026-07-15	Legal Contract negotiation	benchmark emerging · open	Multi-turn contract redlining and negotiation behavior.	Evaluating legal agents that must preserve objectives across revisions.	dataset results
Vals AI LegalBench commercialhosted Verified 2026-07-15	Legal Legal	scoreboard current · commercial	Hosted leaderboard for legal-model evaluation.	Current legal model comparison	leaderboard
EvasionBench specializedrunnable Verified 2026-07-15	Finance Evasive financial communication	benchmark emerging · open	Detection and handling of evasive answers in corporate earnings calls.	Evaluating financial-analysis systems that must distinguish disclosure from deflection.	dataset results
FinanceAgent / FAB v2 commercialhosted Verified 2026-07-15	Finance Finance	agent benchmark current · commercial	Financial-agent benchmark for realistic analyst and finance workflow tasks.	Finance agents and analyst workflows	leaderboard
FinanceBench recommendedrunnable Verified 2026-07-15	Finance Finance	benchmark current · open	Open-book financial QA over public company filings with evidence strings.	Financial analyst and SEC filing workflows	repo
FinBen / PIXIU recommendedrunnable Verified 2026-07-15	Finance Finance	benchmark current · open	Broad financial benchmark suite across extraction, QA, generation, forecasting, and decision-making.	Finance model evaluation suites	repo paper
FinQA / ConvFinQA recommendedrunnable Verified 2026-07-15	Finance Finance	benchmark current · open	Financial numerical reasoning over reports using structured and unstructured evidence.	Financial math and reasoning	source
GDPval watchpartial Verified 2026-07-15	Finance Finance / Professional work	benchmark emerging · mixed	Economically valuable professional tasks across domains including finance, insurance, and operations.	Professional work automation signal	source paper
Meta-Benchmarks for Financial Services recommendedpartial Verified 2026-07-15	Finance Task-weighted benchmark aggregation	toolkit emerging · open	How 452 reported benchmarks map into 41 work activities and 38 banking business domains.	Designing evidence-weighted model selection for financial-services workflows.	paper
Open FinLLM Leaderboard recommendedhosted Verified 2026-07-15	Finance Finance	scoreboard current · hosted	Open leaderboard for financial LLM evaluation.	Financial model comparison	leaderboard
QFBench watchrunnable Verified 2026-07-15	Finance Finance	agent benchmark emerging · open	State-aware quantitative-finance agent tasks requiring code, market logic, and financial reasoning.	Quant finance agents	source repo
BioASQ recommendedpartial Verified 2026-07-15	Healthcare Biomedical	benchmark current · mixed	Biomedical semantic indexing and QA challenges.	Biomedical search and QA	source
CliBench watchpartial Verified 2026-07-15	Healthcare Healthcare	benchmark emerging · mixed	Clinical decisions on diagnoses, procedures, lab-test orders, and prescriptions with structured output ontologies.	Clinical decision granularity	source
ClinicBench recommendedrunnable Verified 2026-07-15	Healthcare Healthcare	benchmark current · open	Clinical language generation, understanding, and reasoning tasks including open-ended clinical decision-making.	Clinical decision support research	repo
HealthBench recommendedrunnable Verified 2026-07-15	Healthcare Healthcare	benchmark current · open	Realistic health conversations with physician-created rubrics.	Health assistant response quality	source repo dataset
HealthBench Professional watchpartial Verified 2026-07-15	Healthcare Healthcare	benchmark emerging · mixed	Professional healthcare benchmark variant focused on clinician workflows and higher-stakes medical tasks.	Clinical workflow response quality	paper source
MedAgentBench recommendedrunnable Verified 2026-07-15	Healthcare Healthcare / Agents	agent benchmark current · open	Virtual EHR/FHIR environment with clinically relevant tasks requiring agents to retrieve, record, order, and act in medical-record settings.	Clinical workflow agents	source repo paper
MedCalc-Bench recommendedrunnable Verified 2026-07-15	Healthcare Healthcare	benchmark current · open	Medical calculation tasks for evaluating whether LLMs can serve as clinical calculators.	Medical arithmetic and calculators	repo
MedHELM recommendedpartial Verified 2026-07-15	Healthcare Healthcare	harness current · mixed	HELM-style medical evaluation framework with a clinical taxonomy, healthcare task categories, and public/gated/private medical benchmarks.	Holistic medical evals	source site
MedMCQA referencerunnable Verified 2026-07-15	Healthcare Healthcare	benchmark legacy · open	Large-scale medical entrance-exam question set across healthcare subjects.	Medical multiple-choice baseline	dataset
MedQA referencerunnable Verified 2026-07-15	Healthcare Healthcare	benchmark legacy · open	Medical board-style QA for clinical knowledge and exam reasoning.	Medical QA baseline	repo
MultiMedQA recommendedpartial Verified 2026-07-15	Healthcare Healthcare	benchmark current · mixed	Composite medical QA benchmark spanning MedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA, MMLU clinical topics, and HealthSearchQA.	Medical QA suite comparison	paper
OpenMed recommendedrunnable Verified 2026-07-15	Healthcare Healthcare	toolkit current · open	Open-source healthcare NLP toolkit, model hub, and curated clinical/biomedical resources; useful source layer rather than a benchmark leaderboard.	Medical NLP resources and PHI-safe extraction	repo source hub
PubMedQA recommendedrunnable Verified 2026-07-15	Healthcare Biomedical	benchmark current · open	Biomedical research QA requiring reasoning over PubMed abstracts.	Biomedical literature QA	source paper
VoxClinBench watchpartial Verified 2026-07-15	Healthcare Voice / Healthcare	benchmark emerging · mixed	Clinical voice benchmark with cross-lingual expansion.	Medical speech and clinical voice agents	dataset repo
χ-Bench specializedrunnable Verified 2026-07-15	Healthcare Policy-rich healthcare workflows	benchmark emerging · open	Long-horizon healthcare workflows with policy, process, and tool constraints.	Evaluating clinical operations agents beyond medical question answering.	dataset results
AIME 2026 specializedrunnable Verified 2026-07-15	Science & Reasoning Current competition mathematics	benchmark emerging · mixed	Advanced mathematical problem solving on the 2026 AIME contest set.	A dated, difficult math signal with explicit versioning.	dataset results
AIRS-Bench specializedrunnable Verified 2026-07-15	Science & Reasoning Autonomous AI research	benchmark emerging · open	End-to-end ML research ability on tasks derived from state-of-the-art papers.	Comparing AI research agents on executable experiment outcomes.	repo paper
ARC-AGI-2 recommendedpartial Verified 2026-07-15	Science & Reasoning General reasoning	benchmark current · mixed	Abstract visual reasoning tasks focused on generalization from few examples.	Abstract reasoning and generalization stress testing	source paper
BBH / BIG-Bench Hard referencerunnable Verified 2026-07-15	Science & Reasoning General reasoning	benchmark legacy · open	Hard tasks from BIG-bench covering symbolic, logical, and multi-step reasoning.	Legacy reasoning regression checks	repo
CritPt specializedrunnable Verified 2026-07-15	Science & Reasoning Physics	benchmark current · open	Research-level physics tasks from modern physics areas.	Niche frontier physics reasoning	repo
GPQA / GPQA Diamond recommendedrunnable Verified 2026-07-15	Science & Reasoning Science	benchmark current · open	Graduate-level science questions designed by domain experts and difficult for non-experts.	Expert science reasoning	repo paper dataset results
GSM8K referencerunnable Verified 2026-07-15	Science & Reasoning Grade-school math reasoning	benchmark legacy · open	Multi-step arithmetic reasoning on grade-school word problems.	Historical baselines and regression checks; it is increasingly saturated for frontier-model selection.	dataset results
HMMT February 2026 specializedrunnable Verified 2026-07-15	Science & Reasoning Current competition mathematics	benchmark emerging · mixed	Advanced competition mathematics on the February 2026 HMMT set.	A fresh math signal that complements AIME-style evaluation.	dataset results
Humanity's Last Exam recommendedrunnable Verified 2026-07-15	Science & Reasoning General reasoning	benchmark current · open	Expert-level closed-ended multimodal academic questions intended for frontier model differentiation.	Hard frontier academic reasoning	repo source dataset results
LAB-Bench / LABBench2 recommendedrunnable Verified 2026-07-15	Science & Reasoning Biology	benchmark current · open	Biology research tasks covering literature, protocols, databases, DNA/protein sequences, and lab reasoning; LABBench2 adds a newer dataset and harness.	Biology research assistants	paper dataset harness
LifeSciBench specializedhosted Verified 2026-07-15	Science & Reasoning Real-world life science research	benchmark emerging · mixed	Expert-written, expert-reviewed tasks grounded in practical life-science research.	Evaluating whether systems can support realistic research work beyond biology QA.	source
LiveBench recommendedrunnable Verified 2026-07-15	Science & Reasoning General reasoning	benchmark current · open	Fresh, contamination-conscious questions with objective scoring across reasoning, math, coding, language, and data analysis.	General model signal that resists stale benchmark gaming	source repo
MATH / AIME-style evals recommendedrunnable Verified 2026-07-15	Science & Reasoning Math	benchmark current · open	Competitive math problem solving and formal reasoning tasks.	Mathematical reasoning and launch-card comparison	repo
MLE-bench recommendedrunnable Verified 2026-07-15	Science & Reasoning ML research	agent benchmark current · open	Machine-learning engineering tasks drawn from Kaggle-style competitions.	ML engineering agents	repo paper
MMLU-Pro recommendedrunnable Verified 2026-07-15	Science & Reasoning General reasoning	benchmark current · open	Harder multiple-choice academic and professional knowledge benchmark derived from MMLU with more options.	Broad expert knowledge screening	repo paper dataset results
NanoFold Public specializedrunnable Verified 2026-07-15	Science & Reasoning Protein folding	benchmark emerging · open	Scientific-model performance on public protein-folding tasks.	Specialist life-science model comparisons with a public result feed.	dataset results
Pencil Puzzle Bench specializedrunnable Verified 2026-07-15	Science & Reasoning Reasoning	benchmark current · open	Deterministically verifiable constraint-satisfaction puzzle tasks.	Reasoning without LLM judges	repo
SciAgentArena specializedrunnable Verified 2026-07-15	Science & Reasoning Scientific research agents	benchmark emerging · open	Scientific agents on about 200 stepwise-verified tasks across five biomedical domains.	Comparing research-agent reliability, cost, and task-level scientific contribution.	site dataset paper
WeirdML specializedpartial Verified 2026-07-15	Science & Reasoning ML research	benchmark current · mixed	Unusual ML tasks designed to reward actual understanding over rote benchmark skill.	Anti-gaming ML reasoning	source
EVA-Bench watchpartial Verified 2026-07-15	Multimodal & Voice Voice	benchmark emerging · mixed	End-to-end voice-agent evaluation framework for realistic simulated conversations and voice-specific failure modes.	Voice-agent quality beyond transcripts	paper
MathVista recommendedrunnable Verified 2026-07-15	Multimodal & Voice Multimodal / Math	benchmark current · open	Mathematical reasoning over visual inputs.	Visual math reasoning	source
MM-BrowseComp watchpartial Verified 2026-07-15	Multimodal & Voice Multimodal / Browser agents	benchmark emerging · mixed	Emerging multimodal browsing benchmark for web tasks where visual context matters.	Multimodal browsing agents	paper
MMBench recommendedrunnable Verified 2026-07-15	Multimodal & Voice Multimodal	benchmark current · open	Broad multimodal model evaluation suite.	General VLM comparison	repo
MMMU / MMMU-Pro recommendedrunnable Verified 2026-07-15	Multimodal & Voice Multimodal	benchmark current · open	Expert multimodal reasoning where images materially affect the answer.	Vision-language reasoning	repo source dataset results
MultiVox specializedpartial Verified 2026-07-15	Multimodal & Voice Multimodal voice assistants	benchmark emerging · open	Voice assistants on spoken and visual cues including emotion, pitch, timbre, and ambient audio.	Evaluating omni assistants that must combine paralinguistic speech with images or video.	paper
Open ASR Leaderboard referencehosted Verified 2026-07-15	Multimodal & Voice Automatic speech recognition	scoreboard emerging · hosted	Speech-to-text systems across public ASR datasets and efficiency measures.	Shortlisting open speech-recognition models with source-linked results.	dataset results
PBench specializedrunnable Verified 2026-07-15	Multimodal & Voice Referring-expression segmentation	benchmark emerging · open	Pixel-level visual grounding from referring expressions.	Specialist comparison of multimodal grounding and segmentation systems.	dataset results
ScreenSpot-Pro specializedrunnable Verified 2026-07-15	Multimodal & Voice Professional GUI grounding	benchmark emerging · open	Visual grounding of interface elements in high-resolution professional software.	Choosing vision-language models for computer-use agents.	dataset results
Vaani Benchmark specializedrunnable Verified 2026-07-15	Multimodal & Voice Hindi automatic speech recognition	benchmark emerging · open	Hindi speech recognition across real acoustic and language conditions.	Selecting ASR systems for Hindi-language products.	dataset results
Video-MME recommendedrunnable Verified 2026-07-15	Multimodal & Voice Video	benchmark current · open	Video understanding benchmark across temporal and multimodal questions.	Video model comparison	source repo dataset results
VLABench specializedrunnable Verified 2026-07-15	Multimodal & Voice Vision-language-action robotics	benchmark emerging · open	Vision-language-action systems on primitive robotic manipulation tasks.	Comparing embodied models on reproducible task primitives.	dataset results
VocalBench watchrunnable Verified 2026-07-15	Multimodal & Voice Voice	benchmark emerging · open	Speech-interaction benchmark for vocal communication and multi-round voice tasks.	Speech interaction models	repo
VoiceAgentBench watchrunnable Verified 2026-07-15	Multimodal & Voice Voice / Agents	agent benchmark emerging · open	Speech-based agentic tasks with spoken queries, tool/function specifications, multi-turn dialogue, and safety cases.	Voice agents with tools	dataset paper
VoiceBench recommendedrunnable Verified 2026-07-15	Multimodal & Voice Voice	benchmark current · open	LLM-based voice assistant benchmark across speech QA, reasoning, instruction following, safety, and robustness.	Voice assistant model comparison	repo paper
WBench specializedrunnable Verified 2026-07-15	Multimodal & Voice Interactive video world models	benchmark emerging · open	Interactive video world models across multiple behavioral dimensions and metrics.	Comparing action-conditioned world models rather than passive video QA.	dataset results
AgentHarm restrictedrunnablehazardous Verified 2026-07-15	Safety & Security Safety / Agents	agent benchmark current · open	Harmful multi-step agent tasks for evaluating agent safety under tool-use settings.	Agent safety research	dataset paper
CyberSecEval / PurpleLlama recommendedrunnable Verified 2026-07-15	Safety & Security Security	benchmark current · open	Cybersecurity risk and capability evaluations for LLMs.	Defensive cyber-risk evaluation	repo
EVMbench restrictedrunnablehazardous Verified 2026-07-15	Safety & Security Security	agent benchmark current · open	Sandboxed smart-contract benchmark for detecting, patching, and exploiting EVM vulnerabilities.	Smart-contract security agents	source repo
ExploitBench restrictedpartialhazardous Verified 2026-07-15	Safety & Security Security	benchmark current · mixed	Capability ladder from vulnerability identification toward exploitation outcomes.	Cyber capability research with defensive caution	source
HarmBench restrictedrunnablehazardous Verified 2026-07-15	Safety & Security Safety	benchmark current · open	Robust refusal and red-teaming benchmark for evaluating whether models comply with harmful requests.	Safety refusal and red-team evaluation	source paper
MedSafetyBench restrictedrunnablehazardous Verified 2026-07-15	Safety & Security Healthcare / Safety	benchmark current · open	Medical safety benchmark for risky clinical responses and unsafe medical advice.	Medical safety evaluation	repo paper
OS-Harm restrictedrunnablehazardous Verified 2026-07-15	Safety & Security Safety / Computer use	agent benchmark current · open	Safety benchmark for computer-use agents and harmful action sequences.	Computer-use safety	repo paper
SafetyBench referencerunnable Verified 2026-07-15	Safety & Security Safety	benchmark legacy · open	Broad static safety benchmark for LLMs across safety categories.	Static safety baseline	source
SciRisk-Bench specializedpartialhazardous Verified 2026-07-15	Safety & Security AI-for-science risk	benchmark emerging · open	AI-for-science safety across explicit risk dimensions and scientific disciplines.	Mapping scientific-agent failure modes to concrete risk categories.	paper
SEC-bench restrictedrunnablehazardous Verified 2026-07-15	Safety & Security Security	agent benchmark current · open	Software-security tasks including vulnerability discovery, patching, and proof-of-concept style scenarios.	Security engineering agents	source repo
SOSBench specializedrunnablehazardous Verified 2026-07-15	Safety & Security Scientific misuse safety	benchmark emerging · open	Safety alignment on 3,000 regulation-grounded prompts across six high-risk scientific domains.	Testing refusal and policy behavior on knowledge-intensive scientific misuse.	site paper

No benchmarks match those filters. Reset a control or search a broader task.

Methodology & Provenance

How records are selected, labeled, and maintained

Editor: OpenNash Research. Last dataset verification: 2026-07-15.

Inclusion

A record needs a public, identifiable evaluation task or harness and at least one primary or operator source. Lists without inspectable methodology or source evidence are excluded.

Classification

One stable primary domain supports browsing. Secondary tags preserve the benchmark’s narrower subject. Maturity, access, risk, and runnability are separate fields.

Verification

Source URLs are checked for confirmed 404/410 responses. A verification date means the source path resolved; it does not independently reproduce every score or claim.

Editorial labels

Recommended, specialized, reference, watch, restricted, and commercial are navigation aids. They are not benchmark scores, certifications, or paid placements.

Score selection

Every submitted configuration is retained. For comparison, the best reported configuration for each exact model ID and benchmark is selected and labeled; model variants are never silently merged.

Normalization

Raw metrics are not averaged. Results become average-rank percentiles inside each benchmark, with metric direction respected and tied values sharing a percentile.

Benchmark health

Each benchmark receives a transparent weight from participation, observed discrimination, dataset freshness, and submission-source quality. At least three models are required.

Use-case aggregation

Selected percentiles are combined with benchmark-health weights. Coverage and confidence are shown separately; neither is hidden inside the performance score.

Updates

The catalog is reviewed monthly, the official score snapshot is refreshed at least monthly, and source links are checked weekly. Material taxonomy or source changes appear in the changelog below.

Limitations

Reported systems may use different tools, prompts, reasoning settings, scaffolds, budgets, and graders. Public tasks may be contaminated. This is shortlisting evidence, not deployment proof.

Disclosure

OpenNash sells evaluation and agent engineering services. No benchmark pays for inclusion or placement. The commercial relationship is stated so readers can weigh the guidance.

Licensing

Benchmark names and descriptions are provided for reference. Each linked benchmark retains its own license and terms; verify them at the source before use.

Changelog

2026-07-16

Expanded the July 2026 census with the Hugging Face official benchmark registry and current agent, scientific, legal, voice, safety, and meta-evaluation releases; added a provenance-preserving score snapshot and evidence-weighted model chooser.

2026-07-15

Normalized domain, maturity, access, risk, and runnability fields; repaired confirmed dead links; added static rendering, category hubs, and stable exports.

2026-06-07

Published the initial benchmark atlas.

What is the minimum viable eval setup?

Review 20–50 real outputs after a meaningful change, group failures into a taxonomy, and turn the most frequent high-impact failures into binary checks. Expand toward roughly 100 fresh traces or continue until new traces stop revealing important failure types.

Should teams use public benchmarks or private evals?

Use public benchmarks to shortlist models, learn task formats, and find runnable harnesses. Use private evals to prove that your system works on your traces, policies, documents, tools, permissions, and business outcomes.

When should I use an LLM-as-judge?

Use deterministic checks first: schema validation, exact match, tool-result assertions, citations, policy thresholds, or execution tests. Use a model judge when the important failure is subjective and recurring, then validate it against human labels.

Is this a complete list of AI benchmarks?

It is a dated July 2026 census under a public inclusion policy, not a claim that every private, unpublished, or newly released benchmark is captured forever. The catalog currently covers 159 source-linked records; the scored layer is narrower because only benchmarks with inspectable model results can be aggregated.

How are scores from different AI benchmarks aggregated?

Raw metrics are never averaged directly. Each result becomes a within-benchmark percentile, benchmark weights reflect participation, discrimination, freshness, and source quality, and the selected use case combines those percentiles. Coverage and confidence remain separate from performance.

Which AI model is best for my use case?

There is no context-free best model. Choose the closest work preset, require evidence from multiple benchmarks, exclude irrelevant tests, inspect source provenance, then validate the shortlist on private tasks that match your tools, data, policies, latency, and cost constraints.

How often should production evals run?

Run offline gates before release, sample live traces weekly, and perform new error analysis after model swaps, prompt or tool changes, incidents, complaint spikes, or metric drift.

Need an eval stack for your system?

OpenNash helps teams turn real traces and failure modes into offline release gates, online monitoring, human review, and durable production controls.

Book time →Email OpenNash →Read Zero to Eval →

Start With The Decision

Scout broad capability

Match the real task

Build private evals

Current Model Signals

Choose A Model For The Work

Reasoning & knowledge

Browse By Evaluation Domain

General Models

Agents & Tool Use

Coding & SWE

Retrieval & RAG

Customer Support

Legal

Finance

Healthcare

Science & Reasoning

Multimodal & Voice

Safety & Security

Opinionated Benchmark Stacks

Artificial Analysis

tau-bench

tau2 / tau3-bench

tau-knowledge

CRMArena

WorkArena / WorkArena++

BFCL / ToolSandbox

tau-voice

VoiceAgentBench

VoiceBench

VocalBench

EVA-Bench / SOVA-Bench

VoxClinBench

Harvey Legal Agent Benchmark

LegalBench

LegalBench-RAG

LexGLUE

LawBench

Vals AI LegalBench

Artificial Analysis value board

FinanceBench

FinanceAgent / FAB v2

FinQA / ConvFinQA

FinBen / PIXIU

QFBench

Open FinLLM Leaderboard

GDPval

RAGBench / LongBench

Artificial Analysis coding + value

SWE-bench / Verified / Pro

Terminal-Bench 2.0

LiveCodeBench

Aider Polyglot

BigCodeBench

MLE-bench / ML-Bench

HumanEval / MBPP

GPQA / GPQA Diamond

Humanity's Last Exam

LAB-Bench / LABBench2

SciCode

AstaBench

PaperBench

MLE-bench

CritPt

WeirdML / Pencil Puzzle Bench

HealthBench

HealthBench Professional

MedHELM

MedAgentBench

MedQA / MedMCQA / PubMedQA

MultiMedQA

ClinicBench / CliBench

MedCalc-Bench

MedSafetyBench

OpenMed

MTEB / BEIR

CRAG / RAGBench

DocVQA / LongBench

LongBench / InfiniteBench

LegalBench-RAG