TL;DR
When an AI system records its decision as “Approved. Confidence: 94%.” most teams read it as evidence. Auditors read it as a category error. The score answers “how sure was the model?” The audit question is “which specific rule produced this decision?” Those are not interchangeable.
Three things make this matter more in 2026 than it did in 2024:
- The AI proof gap is now measurable. Grant Thornton’s 2026 AI Impact Survey found that 78% of executives lack strong confidence that they could pass an independent AI governance audit within 90 days. Among organizations with fully integrated AI, 74% are very confident. The gap between intent and proof is the central 2026 governance story.
- Incidents are compounding. The AI Incident Database recorded 362 documented incidents in 2025, up 55% from 2024. The McKinsey and AI Index joint survey found that among organizations reporting AI incidents, the share experiencing 3–5 incidents in a year rose from 30% in 2024 to 50% in 2025. Repeat events, not isolated failures.
- Regulators have stopped accepting “the AI did it.” ECOA requires specific principal reasons for adverse credit decisions. GDPR Article 22 requires meaningful information about the logic of automated decisions. EU AI Act Article 13 requires transparency to deployers and Article 86 gives data subjects the right to an explanation. COSO’s February 2026 guidance and PCAOB AS 2201’s December 2026 effective date require reconstructable reasoning for SOX-relevant controls. None of these are satisfied by a number between zero and one.
The architectural fix is to express the AI’s reasoning in the same language an auditor reads, log the specific rule cited at the moment of the decision, and treat confidence (where present) as supporting evidence rather than as the explanation itself.
This post explains why the distinction matters, what the regulators actually require, and how to tell, during a vendor evaluation, whether the AI you’re about to buy is producing audit trails or just producing numbers. For the parallel field-by-field requirement view, see AI audit trail requirements: a 2026 checklist for finance, healthcare, and banking.
The category error at the heart of confidence scores
Confidence scores and audit trails are different artifacts answering different questions. Treating them as interchangeable is the central mistake in most AI governance programs in 2026.
A confidence score tells you how sure the model is. It’s a number between zero and one, derived from the probability distribution the model assigned to its output. It’s a statement about the model’s internal state at the moment of inference. “0.94” means the model’s softmax (or equivalent) placed 94% of the probability mass on the chosen output. That is genuinely useful information for the team running the model.
An audit trail tells you why the decision was right. It’s a reconstructable chain from the input through the specific policy or rule applied to the output. It’s a statement about the world, not the model. The audit trail says “this invoice matched this PO because the vendor, total, and PO number aligned within the 2% tolerance defined in policy X.” The number 94 has nothing to do with the question.
When the auditor asks “why was this decision made,” they are asking for the second artifact. The first artifact answers “how sure was the model that the second artifact applied.” These are related but they are not the same. Substituting one for the other is the category error.
Most enterprise AI systems in 2026 produce only the first artifact. The second is harder to engineer, requires a different architecture, and is what separates audit-ready AI from probabilistic AI that happens to keep a log. For why the architecture matters more than the layered governance veneer, see AI governance is not a checklist—it’s an architectural choice.
Why the substitution feels reasonable (and isn’t)
The reason teams accept confidence scores as audit evidence is that the score feels like an explanation. The number is specific. It’s quantitative. It looks like a measurement. It comes from the system itself, not from a separate documentation process. For most engineering and product teams, that’s the language of proof.
The problem is that the number is measuring the wrong thing.
A 94% confidence score on an invoice match doesn’t tell you that the match was correct. It tells you the model would have made the same decision 94 times out of 100 if presented with similar inputs. Those are different claims. The first is a claim about reality. The second is a claim about the model’s behavior. You can have a model that produces 94% confidence on a decision that’s flatly wrong, and a model that produces 60% confidence on a decision that’s exactly right. The confidence score doesn’t distinguish between them.
The auditor’s job is to verify claims about reality. The confidence score is evidence about the model. Even if the score is accurate (and calibration is its own problem), it doesn’t bear on the question the auditor is asking.
This isn’t an abstract concern. 2026 Aveni research found that hallucination rates across LLMs ranged from 22% at the best-performing end to 94% at the worst, depending on the domain and question type. The same model can be 94% confident in a 94% hallucination rate domain. The number tells you nothing about whether the output is correct. For seven concrete places this fails inside accounts payable today, see the 7 places generative AI quietly fails in accounts payable.
What the regulators actually require
Five regulatory and standards frameworks in 2026 explicitly require something confidence scores cannot provide. The pattern is consistent enough that “we logged the confidence” is not a defense under any of them.
ECOA: “Specific principal reasons”
The Equal Credit Opportunity Act requires creditors to provide specific principal reasons for adverse credit decisions. CFPB Circular 2023-03 made this explicit for “complex algorithms”: the use of AI does not relieve the creditor of the obligation to disclose specific reasons. A 94% confidence score is not a specific principal reason. “Your debt-to-income ratio exceeded our threshold of X” is. The AI must produce the second artifact, not just the first.
GDPR Article 22: meaningful information about the logic involved
For automated decisions with significant effects on EU persons, the data subject has the right to meaningful information about the logic involved. The European Data Protection Board has consistently interpreted “meaningful” as requiring an explanation a reasonable person can understand, not a probability. Confidence scores fail this test on their face.
EU AI Act Articles 13 and 86
Article 13 requires providers of high-risk AI systems to provide transparency to deployers, including documentation of the system’s intended purpose, capabilities, and limitations. Article 86 (the right to explanation) gives affected persons the right to obtain clear and meaningful explanations of the role of the AI system in the decision-making procedure. Full enforcement of high-risk provisions begins August 2, 2026 under current law. Confidence scores are not explanations under either article.
COSO’s February 2026 guidance on generative AI
COSO’s “Achieving Effective Internal Control Over Generative AI,” published February 23, 2026, requires that effective monitoring capture “prompts, inputs, outputs, model and configuration versions, and evidence of human review, sufficient to reconstruct what the AI acted on and show that the control functioned as designed.” Confidence scores are part of the output. They are not the explanation that demonstrates the control functioned as designed.
PCAOB AS 2201’s expanded benchmarking
For audits of fiscal years beginning on or after December 15, 2026, AS 2201’s expanded benchmarking allows auditors to conclude that a fully automated application control remains effective without repeating operating effectiveness testing if (a) ITGCs are effective and (b) the decision logic has not changed since prior-year testing. Confidence scores cannot demonstrate that “the decision logic has not changed.” They demonstrate that “the model’s outputs in this sample fell within an expected confidence band.” Those are different conclusions.
The pattern across all five: regulators are asking for the rule, not the certainty. Confidence scores answer certainty. The rule has to come from somewhere else.
The ISACA framework: what an AI audit trail must show
ISACA’s May 2026 article “The AI Audit Trail: From AI Policy to AI Proof” articulates the cleanest framework available. The article argues that an AI audit trail must show four things:
- Identity. Who, or what, initiated the request.
- Data Lineage. What data was retrieved, referenced, filtered, or denied, and whether that use was authorized for that user, task, or context.
- Control State. What policies, safeguards, and access controls were in force at the time of the decision.
- Temporal Integrity. The specific model, configuration, and data snapshot active when the answer was produced.
Note what’s not in the list. Confidence scores aren’t one of the four. The reason is that confidence is a property of the model’s output, not a property of the audit chain. The four items above are about the path the decision traveled, who controlled that path, and what state the system was in. The confidence is downstream of all four. It can be logged as supplementary information, but it cannot substitute for any of them.
This is also why ISACA’s article opens with the line: “A policy cannot prove that an AI system behaved correctly at the moment it mattered. It cannot prove what the system touched, what controls were applied, or whether the answer should have been produced in the first place. For that, we need runtime evidence.”
Runtime evidence is the audit trail. The confidence score is one signal within it. They are not interchangeable.
What auditors actually do with confidence scores
In 2026 audit cycles across SOX, HIPAA, FFIEC, and EU AI Act-aligned reviews, auditors are increasingly treating confidence scores in one of three ways:
1. Useful supporting evidence, when paired with the rule.
“This invoice was approved per the 3-way match policy (PO 4521 matched on vendor, total within 2% tolerance, GR confirmed); the AI’s confidence in the input data quality was 0.97” is acceptable. The confidence supports the explanation. The explanation doesn’t depend on the confidence.
2. Acceptable as one input to drift monitoring, never as the explanation itself.
A platform whose confidence scores trend downward over time is signaling possible data drift, model drift, or process change. That’s a valid use of the metric. It is not a substitute for explaining individual decisions.
3. Treated as a deficiency when offered as the explanation.
An audit trail that reads “Decision: Approved. Confidence: 0.94” with no associated rule is increasingly cited as a control design deficiency under PCAOB AS 2201 and a documentation gap under COSO’s February 2026 guidance. The most experienced audit teams treat this as a material weakness for in-scope ICFR controls.
The third treatment is new in 2026. Two years ago, a confidence-score-only audit trail might have been characterized as “documentation that should be improved.” In 2026, the regulators have closed the gap between “should be improved” and “is a finding.” For the auditor-question framing of the same shift, see what your SOX auditor will ask about your AI automation.
The architectural distinction: probabilistic vs deterministic reasoning
The reason confidence scores ever became a default audit artifact is that probabilistic AI systems don’t produce explanations natively. Their reasoning is an emergent property of model weights. To produce an “explanation,” the system has to either (a) post-rationalize the output by generating text that describes the decision after the fact, which is itself a probabilistic process and can hallucinate, or (b) report the confidence as a proxy for the missing explanation.
Neither approach satisfies the regulators. Post-rationalization can be wrong. Confidence as proxy answers the wrong question.
The architectural alternative is to ground the AI’s reasoning in explicit, inspectable rules that exist before the decision is made. The rule is the explanation. The decision is the execution of the rule against the input. The audit trail is the link between them. Confidence, if logged at all, is supporting information about input data quality or model agreement, not the explanation itself.
This is the difference between probabilistic AI as a tool and neurosymbolic, deterministic AI as a control. The first is what most enterprise AI looked like in 2024. The second is what the audit-ready architecture looks like in 2026.
In a deterministic, English-as-code system, the audit trail for an invoice approval looks like this:
“Approved per the 3-way match policy that states ‘an invoice matches a PO when the vendor name resolves to the same ERP record, the total agrees within the 2% tolerance, and the goods receipt is confirmed within the same fiscal period.’ Vendor: Acme Corp (ERP ID 4521). Total: $4,892 (PO: $4,850, variance: 0.87%). GR: confirmed 2026-04-15. Result: matched.”
That reads back in plain English. It cites the specific policy. It identifies the specific values. It links the decision to a reproducible chain. No confidence score appears, because none is needed.
How to evaluate this during a vendor pilot
If you are evaluating AI platforms in 2026 and want to know whether you’re buying audit-ready AI or confidence-score-only AI, run these four tests during the pilot. For a market-level view of the platforms that meet this bar today, see top AI platforms for automated reconciliation.
Test 1: Pick a decision, ask for the rule.
Select any decision the platform made during your pilot. Ask the vendor to produce, in plain language, the specific rule or policy that produced it. If the answer includes a confidence score but not a stated rule, you have your answer.
Test 2: Change an input, ask what changes.
Take an input the platform processed correctly. Change one element. Run it again. Ask the vendor’s tool to explain what changed. If the explanation is “confidence dropped from 0.94 to 0.78,” that’s not an explanation of what changed. It’s a measurement of how the model responded to what changed.
Test 3: Ask how the rule is version-controlled.
Confidence scores don’t have versions. Rules do. If the platform cannot show you when a specific rule changed, who changed it, what the diff was, and what the prior version said, the platform is treating its decision logic as opaque model state. Under PCAOB AS 2201, opaque model state re-opens every operating effectiveness conclusion. The same discipline shows up in the AI Bill of Materials (AIBOM) your procurement team will start asking for.
Test 4: Show the audit trail to an auditor before you sign the contract.
If your relationship with your external auditor allows it, walk through a sample audit trail from the vendor’s platform during evaluation. Ask the auditor whether the sample would satisfy a walkthrough under your control environment. Audit firms in 2026 are increasingly willing to do this informally, because they would rather flag the gap during evaluation than during the integrated audit.
How Kognitos handles this
Kognitos is a deterministic, neurosymbolic AI platform built specifically around the question this post describes. Every Kognitos automation:
- Is written in plain English. The matching policy, the exception logic, the approval criteria, and the posting rule are all expressed in English-as-code. The English the auditor reads in the walkthrough is the same English that runs in production.
- Executes deterministically. Same input produces the same output every time. The specific rule that drove the decision is cited in the audit log, not just the match outcome.
- Logs the four ISACA fields by default. Identity (authenticated user plus AI system identity), Data Lineage (every input with source attribution), Control State (the specific rule in force at the moment of the decision), Temporal Integrity (model and configuration version pinned and logged).
- Treats confidence as supporting evidence, never as explanation. Where probabilistic inputs are involved (such as document extraction from unstructured invoices), the confidence on the input quality is logged alongside the deterministic rule that processed it. The rule is the explanation. The confidence is metadata.
- Maps to SOX, COSO, PCAOB AS 2201, ECOA, GDPR Article 22, and EU AI Act Articles 13 and 86 by design. The architecture was built for the question regulators are now asking, not retrofitted to it.
Kognitos is SOC 2 Type II, HIPAA, GDPR, and ISO 27001 aligned, with ISO/IEC 42001 alignment work underway (see our Trust portal).
If you are preparing for a 2026 audit cycle and want to see what English-as-code audit trails look like in production, we’d be glad to walk through a working example on your highest-risk AI-touched process.
Book a working session with a Kognitos solutions engineer → or try Kognitos free →
Last updated: May 2026. This article is intended for informational purposes and does not constitute legal, audit, or compliance advice. Regulatory requirements continue to evolve. Engage qualified counsel for guidance specific to your control environment and jurisdiction.
