A private equity operating partner once told me his portfolio company had found the perfect first AI project: an agent that would read inbound RFPs and draft tailored responses. High visibility, board-level excitement, a clean revenue story. Six weeks in, the project was quietly dead. The RFPs lived in three different inboxes, half were scanned PDFs, and nobody could agree on what a good response looked like. There was no fast way to check whether the agent was right, so every draft got a full human rewrite anyway.

Two floors down, the finance team had automated something nobody bragged about: matching incoming invoices to purchase orders. It shipped in nine days and has run ever since.

The gap was not model quality or budget. Both teams had the same tools. The difference was workflow selection. The finance team picked a task where the data was clean, the answer was checkable, and a mistake cost nothing to undo. The RFP team picked the task with the biggest slide in the board deck. This post is the scorecard I wish that second team had used before they started.

Start where the proof is clean, not where the spotlight is

The instinct to automate the highest-value workflow first is understandable and almost always wrong. High-value workflows tend to be high-value precisely because they are hard, judgment-heavy, and consequential. Those are the same traits that make them dangerous to hand to a new system you do not yet trust.

The failure data backs this up. Gartner predicted that at least 30% of generative AI projects would be abandoned after proof of concept by the end of 2025, citing poor data quality and unclear business value. A RAND study on why AI projects fail found the rate is roughly double that of IT projects without AI, and the top causes were rarely the model. They were misunderstood problems, bad data, and teams chasing the latest technology instead of solving a defined task.

Anthropic's engineering team makes the same point from the build side. Their guidance on building effective agents opens by telling you to find the simplest solution that works and to add agentic complexity only when a simpler workflow falls short. The most successful implementations they have seen were not the most ambitious ones.

So the first question is not "where would AI create the most value?" It is "where can AI create real value while the risk stays small and the proof stays clean?" Those are different questions, and the scorecard exists to keep them separate.

This is the natural next step after the earlier work in this series. Once you have mapped the process, found the system of record, and measured a baseline, you usually have three or four viable candidates. The scorecard turns that shortlist into a ranked decision instead of a hallway debate.

The eight axes of an AI workflow readiness scorecard

Score every candidate on the same eight axes, each from 1 to 5, where 5 is the most favorable. Four axes measure upside and feasibility. Four measure how much trouble a mistake can cause. Keeping both groups visible stops a high-value number from hiding a fatal risk number.

Axis What a 5 looks like What a 1 looks like
Business value Clear time or revenue impact, named owner asking for it Speculative value, nobody waiting on it
Process clarity Documented steps, mostly deterministic rules Tribal knowledge, judgment varies by person
Data availability Clean, structured data in one system of record Scattered across inboxes, PDFs, and spreadsheets
Exception handling Few, predictable edge cases Long tail of one-off exceptions
Risk containment Small blast radius if the output is wrong A single error can lose a deal or breach a rule
Reversibility A bad action is trivial to undo Mistakes are hard or impossible to reverse
Human review cost An output can be verified in seconds Verifying takes as long as doing it by hand
Integration simplicity One or two systems with solid APIs Many brittle systems, manual handoffs

The first four axes are familiar. Most teams already argue about business value and data quality. The second four are where projects actually die, and they get far less attention before the work starts.

A quick note on the data axis. The data problems undermining mid-market AI projects are the single most common reason a promising workflow scores low here. If the inputs are not already sitting in a system you can query, you are signing up to build a data pipeline before you build an agent, and that pipeline is the real project.

How to score a candidate workflow

Add the eight axes for a total out of 40. That total gives you a quick ranking. But a sum alone is dangerous, because it lets a strong workflow average away a deadly weakness. So apply two hard gates on top of the total.

  • Gate 1 (risk containment): any candidate scoring 1 on risk containment is out, no matter how high the total. You do not want your first project to be the one where a wrong output costs you a client or a clause.
  • Gate 2 (reversibility): any candidate scoring 1 on reversibility is out for autonomous action. It can still be a fine first project if you keep a human in the approval loop, but it cannot run unattended.

With the gates in place, read the total like this:

Total (of 40) Reading
32 to 40 Strong first project. Build it.
26 to 31 Viable with guardrails. Add human review on the weak axes.
20 to 25 Not yet. Fix the lowest axis first, usually data or process clarity.
Below 20 Wrong workflow for now. Revisit after the foundations improve.

The point of the gates is to make a specific failure mode impossible: shipping the impressive workflow that nobody can verify. A 35 total with a checkable output beats a 33 total where every result needs a manual audit. The scorecard should make that trade explicit instead of leaving it to whoever is most excited in the room.

One caution on the weighting. Resist the urge to multiply business value by some large factor to "let the important work win." That defeats the purpose. The scorecard exists to protect you from your own enthusiasm. If you want to weight anything, weight human review cost, because it compounds on every single run.

The two axes teams skip: reversibility and human review cost

Almost everyone scores business value and data availability. Almost nobody scores reversibility and human review cost before they start. These two decide whether a pilot survives its first bad week.

Reversibility is about blast radius over time. An agent that drafts an email a human sends is highly reversible, because the human is the undo button. An agent that posts journal entries to a closed accounting period or sends outreach to your entire customer list is not. Simon Willison's writing on the lethal trifecta for AI agents is the sharpest version of this warning: when an agent combines access to private data, exposure to untrusted input, and the ability to take external action, a single bad instruction can cause real damage. Low reversibility plus that trifecta is the combination to keep out of a first project entirely.

Human review cost is the axis with the worst return on neglect. The question is simple: when the agent produces an output, how long does it take a person to know whether it is right? If the answer is "a few seconds, because the match is either correct or it is not," you have a workflow that scales. If the answer is "about as long as doing it from scratch," you have automated nothing. You have added a review step.

This is the practical heart of why evaluation matters so much. Hamel Husain's work on why your AI product needs evals makes the case that you cannot improve what you cannot judge, and that error analysis is where most of the real engineering time goes. A workflow with cheap, fast verification gives you a built-in eval loop for free. A workflow where correctness is expensive to assess starves that loop from day one. Score the verification cost honestly, because it sets the ceiling on everything else.

A useful test: imagine the agent runs 100 times tonight. Tomorrow morning, how long does it take one person to confirm those 100 outputs were acceptable? If the honest answer is more than a few minutes, the human review axis is low, and you should know that before you write a line of code.

Scoring three real candidates

Here is the scorecard applied to a real mid-market shortlist. A PE-backed services company had three workflows on the table and a board that wanted "something with AI" by the next quarter.

Axis RFP response agent Invoice-to-PO matching Contract clause extraction
Business value 5 3 5
Process clarity 2 5 3
Data availability 2 5 3
Exception handling 2 4 2
Risk containment 3 4 1
Reversibility 4 5 2
Human review cost 1 5 2
Integration simplicity 2 4 3
Total 21 35 21

The two workflows with the highest business value, RFP responses and contract clause extraction, tied for last at 21. The boring one won at 35 and cleared every gate.

Look at why. The RFP agent scored a 1 on human review cost, because there was no fast way to tell a good draft from a mediocre one, so every output needed a full read. Contract clause extraction scored a 1 on risk containment and got gated out entirely: a missed indemnity clause during diligence is the kind of error that ends a deal, and you do not test that for the first time on a live transaction. The same logic shows up in serious treatments of AI in private equity due diligence, where AI accelerates document review but a human still owns the legal conclusion.

Invoice matching looked unglamorous on the scorecard and looked even less glamorous on the slide. But the data sat in one ERP, every match was right or wrong in a way anyone could check in seconds, and a mistake was a single line item to correct. That company shipped it, built trust with the finance team and the board, and used the same plumbing to tackle a harder workflow two quarters later. The scorecard did not tell them to avoid the hard problems. It told them what order to do them in.

This pattern holds across the mid-market AI adoption data: the companies seeing real returns are not the ones with the most ambitious first projects. They are the ones that picked a contained workflow, proved the operating model, and expanded from a position of trust.

How OpenNash Can Help

If you have a shortlist and a board that wants a decision, the scorecard is the fastest way to turn the argument into a ranking everyone can read. OpenNash runs this as the front end of every engagement: we audit your operations, score the candidate workflows on these eight axes, and pick the first build where value is real and risk is contained.

From there the model is consistent. We design the guardrails, human approvals, and failure handling before deployment, build and test against a checkable proof source, and hand over a production system you fully own, including the CI/CD and documentation. We will also tell you plainly when the right first move is to fix a data foundation rather than build an agent, and when a packaged tool beats a custom one for your situation.

If you want to score your own shortlist against this rubric, book a call to map this to your workflow. Bring three candidates. We will help you find the one where the proof is clean and the downside is small, which is almost never the one on the slide.