Your AI agent passes every eval. It handles edge cases. It costs less per query than the intern it replaced. And nobody uses it.
This is the most common failure mode we see at OpenNash, and it has nothing to do with model quality. A procurement team at a mid-size retailer we worked with had an agent that could process vendor invoices with 94% accuracy. After three months, adoption was at 12%. The team had reverted to spreadsheets. When we asked why, the answer was always some version of the same thing: "I don't trust what it's doing."
The gap between a working agent and a used agent is almost entirely a UX problem. Chip Huyen nailed this when she wrote that the journey from 0 to 60 is easy, while progressing from 60 to 100 becomes exceedingly challenging. That last 40% is not about making the model smarter. It is about making the interface trustworthy.
Why Model Quality Is Not Your Adoption Problem
Here is an uncomfortable truth for engineering teams: users do not evaluate agents the way benchmarks do. A user who sees an agent produce one wrong answer with no explanation will trust it less than an agent that is right 95% of the time but clearly shows its work.
The NNGroup State of UX 2026 report makes this explicit - the design challenge has shifted from "can AI do the task" to "can the user understand and trust what AI did." The technical capability gap is closing. The trust gap is widening.
Google's People + AI Research (PAIR) guidelines frame this around mental models. Users need to build an accurate mental model of what the agent can and cannot do. When the interface gives no signals about capability boundaries, users either over-trust (dangerous) or under-trust (wasteful). Both outcomes kill adoption.
This matters for the business case. You can spend six months improving your agent's accuracy from 91% to 96%. Or you can spend two weeks redesigning how it communicates what it is doing. The second option will move your adoption numbers more. Every time.
Progressive Disclosure: How Much Reasoning to Show
The instinct when users say "I don't trust it" is to show everything. Full chain-of-thought. Every tool call. The complete reasoning trace. This is almost always wrong.
Most users do not want to read your agent's reasoning. They want to know three things: what did it do, how confident is it, and can I verify it if I need to. Progressive disclosure gives them exactly that, layered by interest.
Here is the pattern that works in production:
Layer 1 - The Result. Show the output with a confidence indicator. Not a percentage (users do not know what "87% confident" means in context). Use human-readable signals: a green checkmark for high confidence, a yellow flag for "you should review this," a red stop sign for "I could not complete this."
Layer 2 - The Summary. One click deeper, show a plain-English explanation of what the agent did. "I found 3 matching invoices, cross-referenced them against the purchase order, and flagged a $200 discrepancy on line item 4." This is the layer most users actually read.
Layer 3 - The Audit Trail. For power users and compliance, expose the full trace. Which tools were called, what data was accessed, what the model's reasoning was at each step. Most users never open this. But knowing it exists builds trust.
Microsoft's HAX (Human-AI Experience) Toolkit formalizes this as "make clear how well the system can do what it can do." The key insight is that transparency is not about volume of information - it is about the right information at the right moment.
The procurement agent we mentioned earlier? We added Layer 1 and Layer 2. Adoption went from 12% to 67% in six weeks. We did not change a single model parameter.
Five Transparency Patterns for Agent Interfaces
Not all agents need the same type of transparency. After building agent interfaces across a dozen client deployments, we have landed on five patterns that map to different use cases.
| Pattern | Best For | Example |
|---|---|---|
| Confidence Badges | Classification and routing tasks | "High confidence" / "Needs review" tags on each output |
| Step Timeline | Multi-step workflows | Visual pipeline showing which steps completed and which are pending |
| Source Attribution | Research and analysis agents | Inline citations with expandable source previews |
| Decision Comparison | Recommendation agents | "I chose Option A over Option B because..." with a comparison table |
| Boundary Signals | Any agent with defined scope | "This is outside what I can do. Here is who to contact instead." |
Confidence Badges are the simplest to implement and the highest-impact for most teams. The trick is calibration. If your agent says "high confidence" and is wrong more than 5% of the time, you have destroyed the signal. Research on AI UX trends from Veza Digital points to calibrated confidence as one of the top differentiators between AI products that retain users and those that do not.
Source Attribution matters more than most teams realize. When an agent says "your Q3 revenue was $4.2M," users want to click through and see the source document. This is table stakes for any agent operating on business data. Linear's engineering team has written about how exposing AI reasoning in their product increased feature adoption precisely because users could verify claims against their own data.
Boundary Signals are the most underused pattern. An agent that clearly says "I cannot do this" is more trustworthy than one that tries and produces garbage. UX and AI in 2026 from CleverIT Group puts it well - the shift from experimentation to trust requires designing for what the system cannot do, not just what it can.
Chat Is Not Always the Answer
Every agent demo is a chatbot. Type a question, get an answer. It makes for great demos and terrible products.
Chat interfaces work when the interaction is exploratory, open-ended, and conversational. They are a poor fit for the majority of actual business agent use cases, which are structured, repetitive, and task-oriented.
Consider an agent that processes expense reports. A chat interface means the user types "process my expense report" and then answers a series of questions: "Which report? What date range? Should I flag items over $500?" This is objectively worse than a form with dropdowns and a "Run" button.
Here are the four interface patterns we use, and when each one fits:
Task-Based Interface. A structured form or dashboard where the user configures inputs and triggers the agent. Best for: invoice processing, data enrichment, report generation. The user does not need to "talk" to the agent. They need to point it at a job and press go.
Ambient Interface. The agent runs in the background and surfaces results through notifications, badges, or dashboard widgets. Best for: monitoring, anomaly detection, lead scoring. The user does not interact with the agent at all unless something needs attention.
Inline Interface. The agent appears within the user's existing workflow - a suggestion bar in a document, a recommendation panel in a CRM, auto-complete in a form field. Best for: writing assistance, data entry, code completion. The agent is a feature, not a product.
Conversational Interface. The traditional chat pattern. Best for: open-ended research, complex multi-turn troubleshooting, creative brainstorming. This is maybe 20-30% of real-world agent use cases.
Jakob Nielsen's usability heuristics - first published in 1994 and still the foundation of good interface design - include "match between system and real world." An expense report is not a conversation. A monitoring alert is not a conversation. Forcing everything into chat because it is the default AI interface is a category error.
The business impact is direct. One of our clients switched their CRM enrichment agent from a chat interface to an inline panel that showed suggestions directly on the contact record. Time-to-completion dropped from 3 minutes per contact to 40 seconds. Usage went up 4x.
Designing Error States That Build Trust
Here is the counterintuitive part: the moments when your agent fails are your best opportunities to build trust.
Most agent interfaces handle errors one of two ways. Either the agent confidently gives a wrong answer (the black-box failure), or it returns a generic "Something went wrong" message (the dead-end failure). Both destroy trust. The first because the user feels deceived. The second because the user feels abandoned.
Good error design follows three principles:
1. Admit uncertainty before it becomes an error. If the agent is not confident, say so before presenting the result. "I found a likely match, but the names differ slightly - please verify" is infinitely better than presenting the wrong match as fact. Akraya's UX research trends report emphasizes that users rate systems with honest uncertainty signals as more competent, not less - even when the uncertainty means more work for the user.
2. Provide a fallback path, not just an error message. "I could not find a matching vendor" is a dead end. "I could not find a matching vendor. Here are the three closest matches, or you can enter the details manually" keeps the user moving. Every error state should answer the question: "OK, so what do I do now?"
3. Make errors learnable. If the agent failed because the user's input was ambiguous, show why. "Your query 'fix the Johnson account' matched 4 accounts. Next time, include the account number." This turns a failure into a training moment and reduces future errors.
The underlying principle is what Google's PAIR guidelines call "graceful failure" - the system should fail in ways that preserve the user's ability to complete their task, even if the agent cannot help.
We built an agent for a legal team that reviews contracts for compliance issues. Early versions flagged problems but did not explain why. The lawyers ignored it - not because it was wrong, but because they could not tell whether it was right. We added a "reasoning" panel that showed which specific clause triggered the flag and which regulation it mapped to. The same agent, same accuracy, but now the lawyers actually used it because they could evaluate its judgment.
The Trust Stack: A Framework for Agent Interface Design
If you are designing an agent interface right now, here is the framework we use internally. We call it the Trust Stack because each layer builds on the one below it.
Layer 1: Predictability. The agent should behave consistently. Same input, same type of output. Users should never be surprised by the format or scope of what comes back. This is the foundation - without it, nothing else matters.
Layer 2: Transparency. Use progressive disclosure (described above) to make the agent's process visible. Not all of it, all the time. Just enough that the user can build an accurate mental model.
Layer 3: Control. Give users the ability to override, correct, and configure the agent. Undo buttons. Edit capabilities on agent outputs. Settings for how aggressive or conservative the agent should be. Control is not just functional - it is psychological. Users who feel in control trust more, even if they never use the controls.
Layer 4: Accountability. Audit trails, version history, and clear attribution of which actions the agent took versus which actions the user took. This matters for compliance, but it also matters for trust. When something goes wrong, the user needs to understand what happened and why.
Layer 5: Improvement. The agent should visibly get better based on user feedback. When a user corrects an output, the next similar output should reflect that correction. This closes the loop and transforms the relationship from "tool I use" to "system that learns from me."
Most teams start at Layer 2 (transparency) because it feels like the obvious trust problem. But if Layer 1 (predictability) is broken - if the agent sometimes returns a table and sometimes returns a paragraph for the same type of query - no amount of transparency will help.
Start at the bottom. Get predictability right. Then work your way up.
The real competitive advantage here is not building a smarter agent. It is building an agent that feels trustworthy enough for people to change their workflow around it. That is a design problem, not a model problem. And right now, almost nobody is treating it that way.