What should I measure before implementing AI automation?

Five things per workflow: cycle time (request to completion), volume (units per week), exception rate (percent needing escalation or rework), fully loaded cost per unit, and quality (error or rework rate). These five numbers form the baseline that any AI business case gets judged against.

How long does it take to baseline a workflow before AI?

One to two weeks for a single workflow if your system of record has timestamps. Pull 20-50 recent work items, measure elapsed time and touch time, count exceptions, and calculate loaded cost. Avoid multi-month measurement projects; a defensible sample beats a perfect census.

What is a good exception rate baseline for automation?

Most back-office workflows run 15-40% exceptions when first measured, which surprises teams who guessed 5%. The exception rate matters more than the average case because exceptions determine how much human review your AI deployment will still need.

How do you calculate cost per unit for a manual workflow?

Take fully loaded hourly cost (salary plus benefits and overhead, typically 1.25-1.4x base salary), multiply by touch time per unit, then add a share of rework and escalation cost. A 20-minute invoice touched by a $45/hour loaded analyst costs roughly $15 before exceptions.

Why do AI automation ROI claims fail without a baseline?

Because after deployment, memory of the old process inflates or deflates. Finance asks what changed, and the team can only offer anecdotes. A baseline captured before launch converts the same question into arithmetic: cycle time went from 3.2 days to 4 hours, exceptions from 28% to 11%.

The AI Workflow Baseline: Measure Cycle Time, Volume, Exceptions, Cost, and Quality Before You Build

A portfolio company CFO asked me a question last quarter that killed a six-figure automation proposal in one sentence: "What does this process cost us today?" Nobody in the room knew. Not the ops lead who lived in the workflow daily, not the consultant pitching the agent, not the deal team sponsor. The proposal was full of projected savings, and not one of those projections had a denominator.

This happens constantly, and it is the most fixable failure in AI adoption. Bain's Private Equity Midyear Report 2026 notes that firms are pushing portfolio companies hard on AI-driven operating improvements, but the gap between firms that capture value and firms that capture slideware comes down to measurement discipline. Cherry Bekaert's 2026 PE outlook makes a similar point: value creation plans now assume AI efficiencies, which means someone has to prove those efficiencies happened.

You prove them with a baseline. Before any AI touches a workflow, you need five numbers: cycle time, volume, exception rate, cost, and quality. This post covers how to get each one in a week or two, how to assemble them into a scorecard, and how the same scorecard becomes your launch gate and your monthly operating review.

Why the Baseline Is the Business Case

An AI automation business case has exactly one honest structure: here is what the workflow costs and delivers today, here is what we expect after, here is the delta, here is what the build costs. Every term in that equation except the build cost depends on the baseline. Skip it and you are not making a business case, you are making a vibe case.

There is a second reason that matters more in a PE context. Mid-market portfolio companies live and die by the operating review. If your AI deployment cannot show up in the monthly pack as a before/after on metrics the deal team already tracks, it will be invisible at exactly the moment you need budget for the next workflow. McKinsey's research on capturing value from gen AI keeps finding the same pattern: organizations that report bottom-line impact from AI are the ones that wired measurement into the deployment from day one, not the ones that retrofitted metrics after the fact.

And there is a third reason that is purely defensive. After launch, the old process evaporates from memory. People will tell you the manual version took "about a day" when your timestamps say 3.4 days. They will tell you "we almost never made mistakes" when the rework queue says 6%. Without a baseline, every post-launch argument about whether the AI helped is a memory contest. With one, it is arithmetic.

The takeaway: the baseline is not a measurement chore you do before the real work. It is the business case, the launch gate, and the operating metric, captured once.

The Five Baseline Questions

For each workflow you are considering automating, answer five questions. Pull 20-50 recent completed work items from your system of record (the workflow discovery and system-of-record mapping you did earlier tells you where those live) and measure against real items, not recollections.

1. Cycle time: how long from request to done?

Measure two different clocks:

Elapsed time: timestamp of the triggering event (email received, ticket opened, invoice arrived) to timestamp of completion. This is what customers and internal stakeholders feel.
Touch time: actual human working minutes on the item. This is what drives cost.

The gap between the two is queue time, and it is usually enormous. A vendor onboarding that takes 11 days elapsed often contains 90 minutes of touch time. That gap is your automation opportunity statement: AI agents collapse queues because they do not wait for someone to get back from lunch, and they make the touch time cheaper.

Report cycle time as a distribution, not an average. Median plus 90th percentile is the minimum. The 90th percentile is where escalations, angry emails, and churn live.

2. Volume: how many units, and how lumpy?

Units per week is the headline number, but capture the shape too: month-end spikes in finance ops, Monday floods in support, seasonal RFQ waves. Volume does two jobs in the business case. It is the multiplier on per-unit savings, and it determines whether automation is even worth it. A workflow that takes 45 minutes of touch time but happens six times a month is a checklist problem, not an AI problem.

3. Exception rate: what percent breaks the happy path?

Count items that needed escalation, a second pair of hands, missing-information chase-downs, or rework. This is the number teams guess worst. Almost everyone estimates single digits; measured reality in mid-market back offices is typically 15-40%.

The exception rate is the most important number on the whole scorecard, for a blunt reason: it sets the ceiling on autonomy. If 30% of invoices need a human judgment call today, your AI deployment will need a review lane for roughly that share on day one, and your cost model has to include it. Vendors who quote savings assuming 100% straight-through processing are quoting a workflow that does not exist.

While you count exceptions, categorize them. Five or six categories usually cover 80% of cases (missing PO number, price mismatch, new vendor, unusual terms). Each category is a future design decision: automate, route to human, or fix upstream.

4. Cost: what does one unit cost, fully loaded?

The formula is short: fully loaded hourly cost x touch time per unit, plus an allocation for exceptions and rework. Fully loaded means salary plus benefits, payroll tax, and overhead, typically 1.25-1.4x base. An AP analyst at $70k base is roughly $45/hour loaded; if an invoice takes 20 touch minutes, that is $15 per clean invoice, and exceptions often double it.

Resist the temptation to count "soft savings" like freed-up time at full value. CFOs discount that line to near zero unless headcount, overtime, or contractor spend actually moves. Anchor the hard case on measurable cost, and let cycle-time and quality improvements carry the strategic argument.

5. Quality: how often is the output wrong?

This is the baseline almost nobody captures, and it produces the most useful surprise in the whole exercise: humans are not as accurate as everyone assumes. Sample 30-50 completed items and audit them. Typical findings: 3-8% of manually processed invoices have a coding or amount error, 10-20% of support replies miss part of the customer's question, onboarding packets ship with missing fields regularly.

This number reframes the entire AI conversation. The question is never "can the AI be perfect," it is "can the AI beat 95% with cheaper review." That is the same logic Hamel Husain applies to evaluating LLM systems: you cannot judge a model's outputs without first doing error analysis on real examples and defining what failure means in your context. Baselining human quality is error analysis on the incumbent system, and the rubric you build to audit humans becomes the eval rubric you will use to grade the AI.

The Baseline Scorecard

Put all five numbers for each candidate workflow on one page. Here is the format, with realistic mid-market figures for five common workflows:

Workflow	Cycle time (median / p90)	Volume /wk	Exception rate	Cost per unit	Quality (error rate)
Invoice processing (finance ops)	3.2 days / 9 days	410	28%	$14 clean / $31 exception	4.5% coding errors
Tier-1 support replies	6.5 hrs / 2.1 days	620	22% escalated	$9	14% incomplete answers
Customer onboarding	11 days / 24 days	35	37% missing info	$310	9% packets with defects
RFQ intake and quoting	2.5 days / 6 days	55	31% need engineering input	$95	6% pricing errors
Monthly reporting prep	4 days (fixed)	12 reports/mo	18% require restatement	$480/report	1-2 material corrections/mo

Two notes on using the table. First, it instantly ranks your automation pipeline: volume times cost per unit gives annual spend per workflow, and exception rate tells you implementation difficulty. Invoice processing at 410/week and $14-31 per unit is a ~$400k annual process; that funds a serious build. Second, the scorecard surfaces the workflows you should not automate yet, which is just as valuable. A 37% exception rate on onboarding usually means the upstream intake form is broken; fix the form first, then automate the leaner process. APQC's process benchmarking data is useful for sanity-checking whether your numbers are merely bad or unusually bad; an AP cost per invoice 3x the benchmark median is a different conversation than one near it.

The takeaway: one page, five rows, five columns. If a workflow cannot fill its row, it is not ready for an automation decision.

From Baseline to Launch Metrics to Operating Metrics

Here is where the baseline pays for itself three times. The same five metrics serve three distinct phases, and you should write all three down before building anything.

Phase 1: Launch criteria. Before the agent goes live, define pass/fail against the baseline. Example for invoice processing: quality on a 100-invoice eval set must be at or above 95.5% (beating the human 4.5% error rate), exceptions must route to the review queue with zero silent failures, and cycle time on the straight-through path must be under 4 hours. This is the same idea as a deployment gate in software delivery; the DORA research program showed years ago that teams who define delivery metrics up front outperform teams who argue about them after incidents, and AI deployments behave the same way.

Phase 2: Ramp targets. Weeks one through eight. You are not expecting full savings yet because review coverage starts high. Track straight-through rate weekly (expect it to climb from maybe 40% to 70%+ as you tune exception categories), and track quality on every reviewed item, since review labor is free eval data.

Phase 3: Monthly operating metrics. The five baseline numbers, re-measured monthly, sitting next to their pre-AI values in the operating pack. Cycle time 3.2 days then, 5 hours now. Cost per unit $14 then, $4.10 now including review labor and model spend. This is the slide that gets the second workflow funded, and in a PE context it is the slide that survives diligence at exit, when a buyer's advisors ask whether the EBITDA improvement attributed to AI is real. Moonfare's 2026 private equity outlook flags AI-driven operational value as a theme buyers will increasingly test rather than take on faith; a metrics trail that predates the deployment is the cleanest possible answer.

One discipline note: keep measuring cost honestly after launch. Post-AI cost per unit includes model and infrastructure spend, the review lane, and maintenance time. Teams that quietly drop those from the denominator get caught, usually by the same CFO who asked the original question.

Where Baselining Goes Wrong

Four failure modes show up repeatedly:

The census trap. Someone proposes instrumenting everything for a quarter before deciding. Measurement becomes the project, momentum dies. A 30-item sample with timestamps from the system of record gets you within useful accuracy in days. You are sizing a decision, not publishing research.
Interview-only baselines. Asking people how long things take produces answers off by 2-5x in both directions, with no malice involved. People report touch time when you ask about elapsed time and forget exceptions entirely. Use timestamps; interview only to explain what the timestamps show.
Averages without distributions. An average cycle time of 2 days hides the 15% of items that take 10. The tail is where customer pain and escalation cost concentrate, and it is often where AI helps most, because agents do not let items rot in queues.
Baseline theater. Numbers get collected, slide gets made, and nobody defines the launch criteria or schedules the monthly re-measure. The baseline only compounds if it becomes the standing instrument. Put the re-measure on the calendar before kickoff.

How OpenNash Can Help

Baselining is the front half of how OpenNash runs every engagement. Our audit phase produces exactly the scorecard described above: we pull the work items, build the timestamps, categorize the exceptions, and audit quality with a rubric, before anyone debates models or architecture. That rubric then becomes the eval harness for the agent we build, the launch criteria are agreed with your team in writing, and the monthly operating metrics ship as part of the deployment, not as an afterthought. Clients own all of it, including the measurement pipeline, after handoff.

If you already have an internal team that can build, you may only need help with the measurement and eval design, and that is a fine engagement shape too. If your workflows are low-volume or your exception rates point to broken upstream processes, the honest advice is to fix those first, and we will say so.

Book a call to map the five-question baseline to your highest-spend workflow.

Start With One Workflow This Week

Pick the workflow your team complains about most. Pull the last 30 completed items from the system of record. Fill in one row of the scorecard: median and p90 cycle time, weekly volume, exception count with categories, loaded cost per unit, and an audited error rate. It takes one analyst a few days.

Then look at the row. Either it makes the automation case for you, with numbers a CFO will accept, or it tells you the process needs cleanup before AI will stick. Both answers are worth more than any vendor projection, and you will have them before the next operating review.