AI Strategy

How to Score an Agentic AI Pilot: The 90-Day Evaluation Framework

Most agentic AI pilots end one of two ways: quietly abandoned after the excitement fades, or scaled on the strength of a good demo and a hopeful sponsor. Both failure modes share a root cause — there was never a scorecard. Here is the 100-point framework that turns the 90-day keep, kill, or scale decision into a defensible one.

Kognitos 13 min read
The 90-day agentic AI pilot scorecard: five weighted dimensions (Outcome Integrity 30, Exception Economics 25, Audit Defensibility 20, Operational Fit 15, Scale Readiness 10) with kill, fix, and scale thresholds. By Kognitos.

TL;DR

An agentic AI pilot should be scored at the 90-day mark on five weighted dimensions totaling 100 points: Outcome Integrity (30), Exception Economics (25), Audit Defensibility (20), Operational Fit (15), and Scale Readiness (10). A pilot scoring 75 or above is ready to scale. A pilot scoring 50 to 74 needs a defined fix-and-recheck cycle, not a scale decision. A pilot scoring below 50 should be stopped or rebuilt, regardless of how impressive the demo was.

The reason most pilots are scored badly is that they are measured on activity (how many transactions the AI touched) rather than on outcome integrity (whether those transactions were correct, explainable, and audit-defensible). A pilot that processed 10,000 invoices at a 70% touchless rate looks successful until you learn that nobody verified the 7,000 auto-approved decisions and the audit trail cannot reconstruct why any of them happened. That is not a successful pilot. It is an unmeasured liability.

This framework scores the things that predict whether a pilot will survive production and scale: whether the AI’s outputs are correct and explainable, whether exception handling is economically sustainable at volume, whether the audit trail satisfies 2026 regulatory standards (COSO February 2026, PCAOB AS 2201, EU AI Act Article 11), whether the workflow fits how the team actually operates, and whether the architecture can extend to the second and third use case. The scorecard, the thresholds, the four metrics that matter most, and the 30/60/90 checkpoint structure are below.

This post covers the scoring decision for a pilot that is already running. For the questions to ask vendors before you buy, see The Agentic AI RFP Template. For the broader multi-year program, see How Enterprise Leaders Build a Long-Term AI Automation Strategy That Scales.

Why agentic AI pilots need a scoring framework

The MIT Project NANDA study (July 2025) found that 95% of enterprise generative AI pilots deliver zero measurable P&L impact. The number gets quoted as evidence that the technology is overhyped. That is the wrong lesson. The technology works in the 5% of cases where it is deployed against the right workflow and measured properly. The 95% failure rate is largely a measurement and selection failure, not a technology failure.

Pilots fail to convert to production for four recurring reasons, and a scoring framework catches all four before the scale decision:

The pilot was measured on activity, not outcome integrity. It processed a lot of transactions. Nobody checked whether the outputs were correct or explainable. Activity metrics look like success and hide the liability.

The exception economics never got calculated. The pilot hit a 70% touchless rate, and everyone celebrated, without anyone measuring how long the remaining 30% took to resolve or whether that resolution cost scales. A pilot can be a productivity loss at scale even at a high touchless rate if exceptions are expensive to clear.

The audit trail was an afterthought. The pilot produced decisions but not reconstructable evidence. This passes unnoticed in a pilot and becomes an expensive remediation project the first time an external auditor samples an AI-touched decision in a 2026 audit cycle.

The pilot succeeded in conditions that will not scale. It worked because a senior person hand-held it, or because it ran on the one clean data source, or because the vendor’s implementation team was in the room. None of those conditions survive the second use case.

A 90-day scorecard forces each of these into the open while the decision is still reversible and cheap.

The 90-day scorecard: five dimensions, 100 points

Score each dimension on its stated scale. Total the five. The threshold table follows.

Dimension 1: Outcome Integrity (30 points)

The single most important question: are the AI’s outputs correct, and can you prove it? This dimension is weighted highest because it is the one most often skipped.

Score it by auditing a representative sample of the pilot’s decisions, not by reading the platform’s own success metrics. Pull at least 100 decisions the AI made autonomously and have a qualified human verify them independently.

Award points as follows: 30 points if independent verification finds a 98%+ accuracy rate on autonomous decisions with every decision traceable to the rule that produced it. 20 points if accuracy is 95 to 98% or some decisions cannot be traced to a specific rule. 10 points if accuracy is 90 to 95% or the verification process itself was difficult because the reasoning was opaque. Zero points if accuracy is below 90%, or if you cannot independently verify the decisions at all because the platform exposes only confidence scores rather than reasoning.

The trap this catches. A platform reporting “94% confident” is not reporting 94% accurate. Confidence is the model’s self-assessment; accuracy is whether it was right. The two are routinely confused, and the gap between them is where pilots quietly fail. See When Confidence Scores Lie.

Dimension 2: Exception Economics (25 points)

A pilot’s touchless rate is meaningless without the cost of the non-touchless remainder. This dimension measures whether exception handling is economically sustainable at production volume.

Measure three things: the touchless rate, the average human time to resolve one exception, and whether that resolution time is falling, flat, or rising as the pilot matures. Then calculate the fully loaded cost per exception and project it to production volume.

Award points as follows: 25 points if the touchless rate is 85%+ and exceptions resolve in under a minute each with plain-language explanations, and resolution time is falling as the system learns. 17 points if the touchless rate is 70 to 85% and exceptions resolve in 1 to 5 minutes. 8 points if the touchless rate is below 70% or exceptions take more than 5 minutes each. Zero points if exception volume is rising over time, or if reviewers are approving exceptions without genuinely verifying them because the queue is too deep (rubber-stamping, which is both an integrity and an economics failure).

The trap this catches. A pilot at 92% touchless with 10-minute exceptions is economically worse at scale than a pilot at 85% touchless with 30-second exceptions. The touchless rate alone hides this. The math only appears when you cost the exceptions. This is the plateau dynamic covered in Why Most Agentic AP Pilots Stall at 70% Touchless.

Dimension 3: Audit Defensibility (20 points)

Can the pilot’s decisions survive an external audit? In 2026 this is not optional for any AI touching financial reporting, credit, healthcare, or regulated data.

Test it concretely. Pick one decision the AI made 60 days ago and ask the platform to reconstruct, end to end: the timestamp, the inputs and their sources, the specific rule or policy applied, the reasoning in plain language, the output, the downstream action, and the human reviewer if any. Then show that reconstruction to whoever owns audit relationships and ask whether it would satisfy a walkthrough.

Award points as follows: 20 points if the platform reconstructs any decision end to end with the specific rule cited in plain language, and your audit owner confirms it would pass. 13 points if reconstruction is possible but requires effort or the reasoning needs interpretation. 6 points if only partial reconstruction is possible. Zero points if the platform logs outcomes and confidence scores but cannot reconstruct the reasoning, which means the audit trail does not exist in any defensible form.

The standards this maps to. COSO’s February 2026 guidance on internal controls over generative AI, PCAOB AS 2201 (effective December 15, 2026) with its expanded benchmarking, and EU AI Act Article 11 (effective August 2, 2026 under current law). The field-level standard is in the AI Audit Trail Requirements checklist.

Dimension 4: Operational Fit (15 points)

Does the pilot fit how the team actually works, or does it require the team to reorganize around the tool? Pilots that demand the second rarely scale, because the reorganization cost multiplies with each new use case.

Assess who can modify the workflow when the process changes (business operators or only developers), whether the team trusts the system enough to act on its outputs, and whether the pilot reduced or merely relocated the work.

Award points as follows: 15 points if business operators can modify the workflow themselves in plain language, the team trusts and uses the outputs, and net work genuinely fell. 10 points if modifications need technical support but turnaround is fast and adoption is solid. 5 points if every change requires developer effort or adoption is reluctant. Zero points if the workflow logic is opaque to the people who own the process, or if the pilot relocated work (from processing to reviewing) without reducing it.

The trap this catches. A pilot that requires a developer for every rule change creates a central bottleneck that becomes the binding constraint at scale. Business-user ownership is not a nice-to-have; it is the difference between a program that compounds and one that queues.

Dimension 5: Scale Readiness (10 points)

Will the conditions that made the pilot succeed survive the second and third use case? This dimension is weighted lowest because it is the most forward-looking, but it is the one that separates a genuine platform from a one-off.

Assess whether the pilot succeeded under realistic conditions or hothouse ones, whether the architecture handles a second workflow without re-implementation, and the realistic time-to-second-workflow.

Award points as follows: 10 points if the pilot ran under production-realistic conditions and the second workflow could launch in a fraction of the first’s time on the same architecture. 6 points if some hand-holding was needed but the path to the second workflow is clear. 3 points if the pilot needed significant vendor support or each new workflow looks like a fresh implementation. Zero points if the pilot only worked under hothouse conditions that will not exist at scale.

The trap this catches. Vendor implementation teams are very good at making the first workflow succeed. The second workflow, built by your team without the vendor in the room, is the real test of whether you bought a platform or a bespoke project.

Scoring thresholds: kill, fix, or scale

Total the five dimensions and read the decision off the table. The threshold is the point of the framework. A score without a pre-committed threshold becomes a number people argue around; a pre-committed threshold makes the decision defensible.

Total score Decision What it means
75–100 Scale The pilot is correct, economical, defensible, adopted, and extensible. Commit to the next workflows.
50–74 Fix and recheck The pilot has a specific, identifiable weakness. Fix that dimension, re-score in 30 to 45 days. Do not scale yet, do not kill.
Below 50 Stop or rebuild The pilot has a structural problem that incremental fixes will not solve. Stop, or rebuild against a different architecture or workflow.

Two rules make the thresholds work in practice.

First, commit to the thresholds before you score, ideally before the pilot even begins. A threshold chosen after seeing the score is not a threshold; it is a rationalization.

Second, a zero in any single dimension caps the maximum total at “fix and recheck” regardless of the arithmetic. A pilot that scores 80 on the strength of four dimensions but zeros Audit Defensibility is not a scale candidate, because the zero is a structural disqualifier, not a deduction. A brilliant, economical, well-adopted pilot whose decisions cannot survive an audit is a brilliant liability.

The four metrics that actually predict scale

Within the scorecard, four metrics do most of the predictive work. If you track nothing else between checkpoints, track these.

Meaningful-review rate. Of the exceptions a human reviewed, what fraction did the human actually change or catch something on? A high touchless rate with a near-zero meaningful-review rate means the humans are rubber-stamping and the real error rate is unknown. A healthy meaningful-review rate means oversight is genuine.

Exception resolution time trend. Not the static number, the trend. Is it falling as the system learns, or flat, or rising? A falling trend is the signature of a system that turns exceptions into institutional memory. A rising trend is the signature of one that will collapse under volume.

Reconstruction success rate. Of a random sample of past decisions, what fraction can the platform fully reconstruct end to end in plain language? This is the leading indicator of audit defensibility, and it is far more honest than asking the vendor whether they are “audit-ready.”

Time-to-second-workflow. Once the first workflow is live, how long until a second, different workflow goes live, built by your team? This is the single best predictor of whether the pilot is a platform or a project. A second workflow that takes nearly as long as the first means you are re-implementing, not scaling.

Note what is absent from this list: total transactions processed, hours saved in the abstract, and the vendor’s reported confidence scores. Those are the activity metrics that make weak pilots look strong.

The 30/60/90 checkpoint structure

Do not wait until day 90 to start measuring. Score lightly at 30 and 60 so the day-90 decision is the confirmation of a known trajectory, not a surprise.

At day 30, the question is whether the pilot is instrumented to be measured at all. Are decisions being logged with enough detail to reconstruct them? Is exception time being tracked? Is a sample being independently verified? If the answer at day 30 is “we are not capturing the data we will need to score this,” that is the most valuable possible finding, because there is still time to fix the instrumentation before the evaluation window closes. Most pilots that cannot be scored at day 90 were not instrumented at day 30.

At day 60, run a provisional score on all five dimensions. The point is to surface the weak dimension early. If Exception Economics is trending the wrong way at day 60, there are 30 days to address it before the real decision. A first score at day 90 with no warning leaves no room to fix anything.

At day 90, run the full score against the pre-committed thresholds and make the call. Because the trajectory was visible at 30 and 60, the day-90 decision should rarely be a shock. The discipline of the earlier checkpoints is what makes the final decision defensible rather than political.

What separates the pilots that scale

Across the agentic AI deployments worth learning from, the pilots that successfully scale share four habits that map directly onto the scorecard.

They instrument for measurement before they start, so the day-90 score is built on real data rather than reconstructed impressions. They score outcome integrity independently, never trusting the platform’s own success metrics as the measure of its success. They cost their exceptions, so a high touchless rate never disguises an uneconomical remainder. And they treat audit defensibility as a day-one design requirement, not a pre-launch scramble, because retrofitting a reconstructable audit trail onto a platform that was not built for one is the most common and most expensive remediation in enterprise AI.

The platforms that score well on this framework tend to share architectural traits: deterministic execution (so the same input reliably produces the same output, which makes outcome integrity verifiable), reasoning expressed in plain language rather than buried in model weights (so audit reconstruction and business-user ownership are both possible), and a single architecture that carries from the first workflow to the next (so time-to-second-workflow is short). Kognitos was built around these traits, which is why deterministic, English-as-code, audit-native platforms tend to score on the scale side of the threshold. But the framework is the point, not the vendor. Score your pilot honestly against these five dimensions whoever built it, and the score will tell you what to do.

Book a working session with a Kognitos solutions engineer → Or try Kognitos free →

Frequently Asked Questions

Score it at the 90-day mark on five weighted dimensions totaling 100 points: Outcome Integrity (30 points, are the outputs correct and verifiable), Exception Economics (25 points, is exception handling sustainable at volume), Audit Defensibility (20 points, can decisions survive an external audit), Operational Fit (15 points, does it fit how the team works), and Scale Readiness (10 points, will it extend to the next workflow). A total of 75 or above supports scaling; 50 to 74 means fix a specific weakness and re-score; below 50 means stop or rebuild. Crucially, evaluate on outcome integrity rather than activity metrics like total transactions processed, and commit to the thresholds before scoring.
A touchless rate above 85% is strong, but the rate alone is misleading without the cost of the remaining exceptions. A pilot at 92% touchless with 10-minute exception resolution can be economically worse at production scale than a pilot at 85% touchless with 30-second resolution. Measure three things together: the touchless rate, the average human time to resolve one exception, and whether that resolution time is falling or rising as the pilot matures. A high and stable touchless rate with fast, falling-cost exceptions is the genuinely healthy signal.
Stop or rebuild a pilot that scores below 50 on the 100-point framework, or one that scores zero on any single dimension regardless of its total. A zero in Audit Defensibility, for example, disqualifies a pilot from scaling even if it scores well elsewhere, because decisions that cannot survive an audit are a liability whatever their accuracy. The decision should be made against a threshold committed to before scoring, so that a disappointing score produces a clear action rather than a debate. Killing a pilot early on a defensible score is a success of the process, not a failure of the program.
Four metrics do most of the predictive work: meaningful-review rate (what fraction of human-reviewed exceptions the human actually caught something on, which reveals whether oversight is genuine or rubber-stamping), exception resolution time trend (falling is healthy, rising signals collapse at volume), reconstruction success rate (what fraction of past decisions can be fully rebuilt end to end, the leading indicator of audit defensibility), and time-to-second-workflow (how fast a second workflow goes live built by your own team, the best signal of whether you have a platform or a one-off project). Total transactions processed and vendor-reported confidence scores are not on this list; they are activity metrics that flatter weak pilots.
Ninety days is the standard evaluation window, structured as three checkpoints. At day 30, confirm the pilot is instrumented well enough to be scored at all, since most pilots that cannot be evaluated at day 90 were never set up to capture the right data. At day 60, run a provisional score to surface the weakest dimension while there is still time to address it. At day 90, run the full score against pre-committed thresholds and make the keep, fix, or scale decision. The earlier checkpoints make the final decision a confirmation of a known trajectory rather than a surprise, which is what keeps it defensible rather than political.
Confidence is the model’s self-assessment of how sure it is; accuracy is whether it was actually right. A platform reporting that decisions are “94% confident” is not reporting that they are 94% accurate, and the gap between the two is where pilots quietly fail. To score outcome integrity, pull a representative sample of at least 100 autonomous decisions and have a qualified human verify them independently, rather than trusting the platform’s own confidence figures. If the platform exposes only confidence scores and cannot let you verify actual accuracy against the reasoning, that itself is a scoring failure on both outcome integrity and audit defensibility.
The MIT Project NANDA study (July 2025) found 95% of enterprise generative AI pilots deliver zero measurable P&L impact, but this is largely a measurement and selection failure rather than a technology one. Pilots fail to convert for four recurring reasons the scoring framework catches: they are measured on activity instead of outcome integrity, the exception economics are never calculated so a high touchless rate hides an uneconomical remainder, the audit trail is an afterthought that becomes an expensive remediation later, and the pilot succeeds under hothouse conditions (heavy vendor support, one clean data source) that do not survive the second use case. A 90-day scorecard forces each of these into the open while the decision is still cheap to change.
For a pilot to scale, business operators who own the underlying process should be able to modify the workflow themselves, ideally in plain language, rather than routing every change through developers. A pilot that requires developer effort for each rule change creates a central bottleneck that becomes the binding constraint as you add workflows. This is scored under Operational Fit: full marks require that business operators can modify the workflow themselves, that the team trusts and acts on the outputs, and that the pilot genuinely reduced work rather than relocating it from processing to reviewing. Business-user ownership is the difference between a program that compounds and one that queues behind a development team.

Last updated: June 2026. This article is intended for informational purposes and does not constitute audit, legal, or procurement advice. Scoring weights and thresholds should be adapted to your organization’s risk profile and regulatory environment. Statistics cited include the MIT Project NANDA study (July 2025) and the 2026 regulatory standards from COSO, PCAOB, and the EU AI Act.

K
Kognitos
Kognitos

Score your pilot — on a platform built for the scale side of the threshold

See how Kognitos’s deterministic execution, English-as-code policies, and 12-field audit trail score on Outcome Integrity, Exception Economics, and Audit Defensibility at the 90-day mark.

Book a Working Session
Or try it free →