AI Governance

When Confidence Scores Lie: Why “94% Confident” Is Not an Audit Trail

Q: What does COSO's 2026 generative AI guidance say about audit trails?

COSO published 'Achieving Effective Internal Control Over Generative AI' on February 23, 2026. The guidance requires that effective monitoring of AI-driven processes capture prompts, inputs, outputs, model and configuration versions, and evidence of human review, sufficient to reconstruct what the AI acted on and show that the control functioned as designed. The standard is reconstructable reasoning, not just decision outputs. Confidence scores can appear in the audit trail as supplementary information about output quality. They cannot, on their own, reconstruct what the AI acted on or demonstrate that the control functioned as designed.

Q: What is the ISACA AI audit trail framework?

ISACA's May 2026 article 'The AI Audit Trail: From AI Policy to AI Proof' defines four elements an AI audit trail must show: Identity (who or what initiated the request), Data Lineage (what data was retrieved, referenced, filtered, or denied), Control State (what policies, safeguards, and access controls were in force at the time), and Temporal Integrity (the specific model, configuration, and data snapshot active when the answer was produced). Confidence scores are not part of the four required elements. ISACA's framework treats confidence as supporting metadata, not as the audit artifact itself.

A confidence score tells you how sure the AI is. An audit trail tells you why the AI was right. Those are not the same thing, and in 2026 your auditor knows the difference.

Kognitos May 20, 2026 12 min read

When Confidence Scores Lie: Why '94% Confident' Is Not an Audit Trail. A 2026 guide to the category error in AI confidence-based audit evidence, with ECOA, GDPR Article 22, EU AI Act Articles 13 and 86, COSO, and PCAOB AS 2201 requirements. By Kognitos.

TL;DR

When an AI system records its decision as “Approved. Confidence: 94%.” most teams read it as evidence. Auditors read it as a category error. The score answers “how sure was the model?” The audit question is “which specific rule produced this decision?” Those are not interchangeable.

Three things make this matter more in 2026 than it did in 2024:

The AI proof gap is now measurable. Grant Thornton’s 2026 AI Impact Survey found that 78% of executives lack strong confidence that they could pass an independent AI governance audit within 90 days. Among organizations with fully integrated AI, 74% are very confident. The gap between intent and proof is the central 2026 governance story.
Incidents are compounding. The AI Incident Database recorded 362 documented incidents in 2025, up 55% from 2024. The McKinsey and AI Index joint survey found that among organizations reporting AI incidents, the share experiencing 3–5 incidents in a year rose from 30% in 2024 to 50% in 2025. Repeat events, not isolated failures.
Regulators have stopped accepting “the AI did it.” ECOA requires specific principal reasons for adverse credit decisions. GDPR Article 22 requires meaningful information about the logic of automated decisions. EU AI Act Article 13 requires transparency to deployers and Article 86 gives data subjects the right to an explanation. COSO’s February 2026 guidance and PCAOB AS 2201’s December 2026 effective date require reconstructable reasoning for SOX-relevant controls. None of these are satisfied by a number between zero and one.

The architectural fix is to express the AI’s reasoning in the same language an auditor reads, log the specific rule cited at the moment of the decision, and treat confidence (where present) as supporting evidence rather than as the explanation itself.

This post explains why the distinction matters, what the regulators actually require, and how to tell, during a vendor evaluation, whether the AI you’re about to buy is producing audit trails or just producing numbers. For the parallel field-by-field requirement view, see AI audit trail requirements: a 2026 checklist for finance, healthcare, and banking.

The category error at the heart of confidence scores

Confidence scores and audit trails are different artifacts answering different questions. Treating them as interchangeable is the central mistake in most AI governance programs in 2026.

A confidence score tells you how sure the model is. It’s a number between zero and one, derived from the probability distribution the model assigned to its output. It’s a statement about the model’s internal state at the moment of inference. “0.94” means the model’s softmax (or equivalent) placed 94% of the probability mass on the chosen output. That is genuinely useful information for the team running the model.

An audit trail tells you why the decision was right. It’s a reconstructable chain from the input through the specific policy or rule applied to the output. It’s a statement about the world, not the model. The audit trail says “this invoice matched this PO because the vendor, total, and PO number aligned within the 2% tolerance defined in policy X.” The number 94 has nothing to do with the question.

When the auditor asks “why was this decision made,” they are asking for the second artifact. The first artifact answers “how sure was the model that the second artifact applied.” These are related but they are not the same. Substituting one for the other is the category error.

Most enterprise AI systems in 2026 produce only the first artifact. The second is harder to engineer, requires a different architecture, and is what separates audit-ready AI from probabilistic AI that happens to keep a log. For why the architecture matters more than the layered governance veneer, see AI governance is not a checklist, it’s an architectural choice.

Why the substitution feels reasonable (and isn’t)

The reason teams accept confidence scores as audit evidence is that the score feels like an explanation. The number is specific. It’s quantitative. It looks like a measurement. It comes from the system itself, not from a separate documentation process. For most engineering and product teams, that’s the language of proof.

The problem is that the number is measuring the wrong thing.

A 94% confidence score on an invoice match doesn’t tell you that the match was correct. It tells you the model would have made the same decision 94 times out of 100 if presented with similar inputs. Those are different claims. The first is a claim about reality. The second is a claim about the model’s behavior. You can have a model that produces 94% confidence on a decision that’s flatly wrong, and a model that produces 60% confidence on a decision that’s exactly right. The confidence score doesn’t distinguish between them.

The auditor’s job is to verify claims about reality. The confidence score is evidence about the model. Even if the score is accurate (and calibration is its own problem), it doesn’t bear on the question the auditor is asking.

This isn’t an abstract concern. 2026 Aveni research found that hallucination rates across LLMs ranged from 22% at the best-performing end to 94% at the worst, depending on the domain and question type. The same model can be 94% confident in a 94% hallucination rate domain. The number tells you nothing about whether the output is correct. For seven concrete places this fails inside accounts payable today, see the 7 places generative AI quietly fails in accounts payable.

What the regulators actually require

Five regulatory and standards frameworks in 2026 explicitly require something confidence scores cannot provide. The pattern is consistent enough that “we logged the confidence” is not a defense under any of them.

ECOA: “Specific principal reasons”

The Equal Credit Opportunity Act requires creditors to provide specific principal reasons for adverse credit decisions. CFPB Circular 2023-03 made this explicit for “complex algorithms”: the use of AI does not relieve the creditor of the obligation to disclose specific reasons. A 94% confidence score is not a specific principal reason. “Your debt-to-income ratio exceeded our threshold of X” is. The AI must produce the second artifact, not just the first.

For automated decisions with significant effects on EU persons, the data subject has the right to meaningful information about the logic involved. The European Data Protection Board has consistently interpreted “meaningful” as requiring an explanation a reasonable person can understand, not a probability. Confidence scores fail this test on their face.

EU AI Act Articles 13 and 86

Article 13 requires providers of high-risk AI systems to provide transparency to deployers, including documentation of the system’s intended purpose, capabilities, and limitations. Article 86 (the right to explanation) gives affected persons the right to obtain clear and meaningful explanations of the role of the AI system in the decision-making procedure. Full enforcement of high-risk provisions begins August 2, 2026 under current law. Confidence scores are not explanations under either article.

COSO’s February 2026 guidance on generative AI

COSO’s “Achieving Effective Internal Control Over Generative AI,” published February 23, 2026, requires that effective monitoring capture “prompts, inputs, outputs, model and configuration versions, and evidence of human review, sufficient to reconstruct what the AI acted on and show that the control functioned as designed.” Confidence scores are part of the output. They are not the explanation that demonstrates the control functioned as designed.

PCAOB AS 2201’s expanded benchmarking

For audits of fiscal years beginning on or after December 15, 2026, AS 2201’s expanded benchmarking allows auditors to conclude that a fully automated application control remains effective without repeating operating effectiveness testing if (a) ITGCs are effective and (b) the decision logic has not changed since prior-year testing. Confidence scores cannot demonstrate that “the decision logic has not changed.” They demonstrate that “the model’s outputs in this sample fell within an expected confidence band.” Those are different conclusions.

The pattern across all five: regulators are asking for the rule, not the certainty. Confidence scores answer certainty. The rule has to come from somewhere else.

The ISACA framework: what an AI audit trail must show

ISACA’s May 2026 article “The AI Audit Trail: From AI Policy to AI Proof” articulates the cleanest framework available. The article argues that an AI audit trail must show four things:

Identity. Who, or what, initiated the request.
Data Lineage. What data was retrieved, referenced, filtered, or denied, and whether that use was authorized for that user, task, or context.
Control State. What policies, safeguards, and access controls were in force at the time of the decision.
Temporal Integrity. The specific model, configuration, and data snapshot active when the answer was produced.

Note what’s not in the list. Confidence scores aren’t one of the four. The reason is that confidence is a property of the model’s output, not a property of the audit chain. The four items above are about the path the decision traveled, who controlled that path, and what state the system was in. The confidence is downstream of all four. It can be logged as supplementary information, but it cannot substitute for any of them.

This is also why ISACA’s article opens with the line: “A policy cannot prove that an AI system behaved correctly at the moment it mattered. It cannot prove what the system touched, what controls were applied, or whether the answer should have been produced in the first place. For that, we need runtime evidence.”

Runtime evidence is the audit trail. The confidence score is one signal within it. They are not interchangeable.

What auditors actually do with confidence scores

In 2026 audit cycles across SOX, HIPAA, FFIEC, and EU AI Act-aligned reviews, auditors are increasingly treating confidence scores in one of three ways:

1. Useful supporting evidence, when paired with the rule.

“This invoice was approved per the 3-way match policy (PO 4521 matched on vendor, total within 2% tolerance, GR confirmed); the AI’s confidence in the input data quality was 0.97” is acceptable. The confidence supports the explanation. The explanation doesn’t depend on the confidence.

2. Acceptable as one input to drift monitoring, never as the explanation itself.

A platform whose confidence scores trend downward over time is signaling possible data drift, model drift, or process change. That’s a valid use of the metric. It is not a substitute for explaining individual decisions.

3. Treated as a deficiency when offered as the explanation.

An audit trail that reads “Decision: Approved. Confidence: 0.94” with no associated rule is increasingly cited as a control design deficiency under PCAOB AS 2201 and a documentation gap under COSO’s February 2026 guidance. The most experienced audit teams treat this as a material weakness for in-scope ICFR controls.

The third treatment is new in 2026. Two years ago, a confidence-score-only audit trail might have been characterized as “documentation that should be improved.” In 2026, the regulators have closed the gap between “should be improved” and “is a finding.” For the auditor-question framing of the same shift, see what your SOX auditor will ask about your AI automation.

The architectural distinction: probabilistic vs deterministic reasoning

The reason confidence scores ever became a default audit artifact is that probabilistic AI systems don’t produce explanations natively. Their reasoning is an emergent property of model weights. To produce an “explanation,” the system has to either (a) post-rationalize the output by generating text that describes the decision after the fact, which is itself a probabilistic process and can hallucinate, or (b) report the confidence as a proxy for the missing explanation.

Neither approach satisfies the regulators. Post-rationalization can be wrong. Confidence as proxy answers the wrong question.

The architectural alternative is to ground the AI’s reasoning in explicit, inspectable rules that exist before the decision is made. The rule is the explanation. The decision is the execution of the rule against the input. The audit trail is the link between them. Confidence, if logged at all, is supporting information about input data quality or model agreement, not the explanation itself.

This is the difference between probabilistic AI as a tool and neurosymbolic, deterministic AI as a control. The first is what most enterprise AI looked like in 2024. The second is what the audit-ready architecture looks like in 2026.

In a deterministic, English-as-code system, the audit trail for an invoice approval looks like this:

“Approved per the 3-way match policy that states ‘an invoice matches a PO when the vendor name resolves to the same ERP record, the total agrees within the 2% tolerance, and the goods receipt is confirmed within the same fiscal period.’ Vendor: Acme Corp (ERP ID 4521). Total: $4,892 (PO: $4,850, variance: 0.87%). GR: confirmed 2026-04-15. Result: matched.”

That reads back in plain English. It cites the specific policy. It identifies the specific values. It links the decision to a reproducible chain. No confidence score appears, because none is needed.

How to evaluate this during a vendor pilot

If you are evaluating AI platforms in 2026 and want to know whether you’re buying audit-ready AI or confidence-score-only AI, run these four tests during the pilot. For a market-level view of the platforms that meet this bar today, see top AI platforms for automated reconciliation.

Test 1: Pick a decision, ask for the rule.

Select any decision the platform made during your pilot. Ask the vendor to produce, in plain language, the specific rule or policy that produced it. If the answer includes a confidence score but not a stated rule, you have your answer.

Test 2: Change an input, ask what changes.

Take an input the platform processed correctly. Change one element. Run it again. Ask the vendor’s tool to explain what changed. If the explanation is “confidence dropped from 0.94 to 0.78,” that’s not an explanation of what changed. It’s a measurement of how the model responded to what changed.

Test 3: Ask how the rule is version-controlled.

Confidence scores don’t have versions. Rules do. If the platform cannot show you when a specific rule changed, who changed it, what the diff was, and what the prior version said, the platform is treating its decision logic as opaque model state. Under PCAOB AS 2201, opaque model state re-opens every operating effectiveness conclusion. The same discipline shows up in the AI Bill of Materials (AIBOM) your procurement team will start asking for.

Test 4: Show the audit trail to an auditor before you sign the contract.

If your relationship with your external auditor allows it, walk through a sample audit trail from the vendor’s platform during evaluation. Ask the auditor whether the sample would satisfy a walkthrough under your control environment. Audit firms in 2026 are increasingly willing to do this informally, because they would rather flag the gap during evaluation than during the integrated audit.

How Kognitos handles this

Kognitos is a deterministic, neurosymbolic AI platform built specifically around the question this post describes. Every Kognitos automation:

Is written in plain English. The matching policy, the exception logic, the approval criteria, and the posting rule are all expressed in English-as-code. The English the auditor reads in the walkthrough is the same English that runs in production.
Executes deterministically. Same input produces the same output every time. The specific rule that drove the decision is cited in the audit log, not just the match outcome.
Logs the four ISACA fields by default. Identity (authenticated user plus AI system identity), Data Lineage (every input with source attribution), Control State (the specific rule in force at the moment of the decision), Temporal Integrity (model and configuration version pinned and logged).
Treats confidence as supporting evidence, never as explanation. Where probabilistic inputs are involved (such as document extraction from unstructured invoices), the confidence on the input quality is logged alongside the deterministic rule that processed it. The rule is the explanation. The confidence is metadata.
Maps to SOX, COSO, PCAOB AS 2201, ECOA, GDPR Article 22, and EU AI Act Articles 13 and 86 by design. The architecture was built for the question regulators are now asking, not retrofitted to it.

Kognitos is SOC 2 Type II, HIPAA, GDPR, and ISO 27001 aligned, with ISO/IEC 42001 alignment work underway (see our Trust portal).

If you are preparing for a 2026 audit cycle and want to see what English-as-code audit trails look like in production, we’d be glad to walk through a working example on your highest-risk AI-touched process.

Book a working session with a Kognitos solutions engineer → or try Kognitos free →

Last updated: May 2026. This article is intended for informational purposes and does not constitute legal, audit, or compliance advice. Regulatory requirements continue to evolve. Engage qualified counsel for guidance specific to your control environment and jurisdiction.

Frequently asked questions

Are AI confidence scores acceptable as audit evidence?

No, not on their own. A confidence score tells you how certain the AI model was about its output. It does not tell you why the output was correct, which rule the AI applied, or what specific reasoning the auditor needs to verify. Under SOX (per COSO’s February 2026 guidance), ECOA (per CFPB Circular 2023-03), GDPR Article 22, and EU AI Act Articles 13 and 86, audit evidence must include the specific reasoning or rule behind a decision, not just a probability score. Confidence scores can supplement an explanation as supporting metadata. They cannot replace one.

What is the difference between a confidence score and an audit trail?

A confidence score is a number representing how certain the AI model is about its output (typically between zero and one). An audit trail is a reconstructable chain from the input through the specific policy or rule applied to the output. The score is about the model’s internal state. The trail is about the path the decision traveled and why. An auditor’s question “why was this decision made” is answered by the trail, not the score. Most enterprise AI systems in 2026 produce only the score; audit-ready systems produce both, with the trail as the primary artifact.

Why is “94% confident” not a valid explanation under GDPR Article 22?

GDPR Article 22 grants individuals the right to meaningful information about the logic involved in automated decisions that significantly affect them. The European Data Protection Board has consistently interpreted “meaningful” as requiring an explanation a reasonable person can understand, expressed in terms of the specific factors that influenced the decision. A probability score (94%) describes the model’s certainty, not the decision logic. It does not allow the data subject to understand which factors were weighed, how their personal data was used, or whether the decision was made on a valid basis. Article 22 substantively requires the rule, not the certainty.

Does ECOA accept AI confidence scores as adverse action reasons?

No. The Equal Credit Opportunity Act requires creditors to provide specific principal reasons for adverse credit decisions. CFPB Circular 2023-03 explicitly addressed AI in credit decisions: the use of complex algorithms does not relieve the creditor of the obligation to disclose specific reasons. A confidence score is a measure of the model’s certainty, not a principal reason. Acceptable adverse action notices identify the specific factors (income, debt-to-income ratio, credit history, etc.) that drove the decision. “Our AI was 94% confident in declining your application” is not a defensible adverse action notice under ECOA.

What does COSO’s 2026 generative AI guidance say about audit trails?

COSO published “Achieving Effective Internal Control Over Generative AI” on February 23, 2026. The guidance requires that effective monitoring of AI-driven processes capture prompts, inputs, outputs, model and configuration versions, and evidence of human review, sufficient to reconstruct what the AI acted on and show that the control functioned as designed. The standard is reconstructable reasoning, not just decision outputs. Confidence scores can appear in the audit trail as supplementary information about output quality. They cannot, on their own, reconstruct what the AI acted on or demonstrate that the control functioned as designed.

Can probabilistic AI ever satisfy regulatory audit trail requirements?

Yes, but not by relying on confidence scores. Probabilistic AI can satisfy regulatory requirements when the system is engineered so that every decision is paired with an explicit, inspectable rule or policy that the AI applied to produce the output. The rule is the explanation. The probabilistic AI is the mechanism that selected which rule to apply, or that extracted the data the rule operates on. This is how deterministic, neurosymbolic AI platforms like Kognitos approach the problem. Pure probabilistic AI that produces only outputs and confidence scores, without a citeable rule layer, struggles to satisfy ECOA, GDPR Article 22, EU AI Act Articles 13 and 86, COSO’s 2026 guidance, or PCAOB AS 2201’s expanded benchmarking provision.

How is calibration different from explanation?

Calibration is the property of a confidence score matching the actual frequency of correctness. A well-calibrated model that outputs 94% confidence on a class of decisions is correct about 94% of the time on that class. Calibration is technically important for model evaluation and drift detection. It is not an explanation of why a specific decision was made. A model can be perfectly calibrated and still produce no usable audit trail. The two are different problems. Calibration tells you how trustworthy the confidence number is. Explanation tells you what rule produced the decision. Auditors need the second; engineers and ML teams need the first.

What is the ISACA AI audit trail framework?

ISACA’s May 2026 article “The AI Audit Trail: From AI Policy to AI Proof” defines four elements an AI audit trail must show: Identity (who or what initiated the request), Data Lineage (what data was retrieved, referenced, filtered, or denied), Control State (what policies, safeguards, and access controls were in force at the time), and Temporal Integrity (the specific model, configuration, and data snapshot active when the answer was produced). Confidence scores are not part of the four required elements. ISACA’s framework treats confidence as supporting metadata, not as the audit artifact itself.

Are auditors trained to spot confidence-score-only audit trails in 2026?

Increasingly yes. Big Four firms in 2026 are training audit staff specifically on AI-touched controls, including how to evaluate AI audit trail completeness. The pattern most commonly cited as a deficiency is an audit trail that captures the model’s output and confidence but not the specific rule or policy that produced the output. With COSO’s February 2026 guidance and the SEC’s March 2026 dedicated SOX enforcement group, firms have stronger institutional reason to flag this gap during integrated audits rather than during management letter comments. Treat the audit trail as a procurement requirement, not a documentation cleanup task.

How do I tell if my AI platform produces real audit trails or just confidence scores?

Run these four tests. First, pick a specific decision and ask the vendor to produce, in plain language, the specific rule or policy that produced it. If the answer includes only a confidence score, the platform does not produce real audit trails. Second, change one input and ask what changed in the explanation. If the explanation reduces to “confidence dropped from 0.94 to 0.78,” that’s measurement of model response, not explanation of rule application. Third, ask how the rule is version-controlled, with timestamps, approvers, and diffs. Confidence scores have no version history; rules do. Fourth, show a sample audit trail to your external auditor during evaluation and ask whether it would satisfy a walkthrough under your control environment. If the auditor flags concerns, address them before signing the contract, not before the audit cycle.

What’s the architectural fix for the confidence score problem?

The architectural fix is to ground the AI’s reasoning in explicit, inspectable rules that exist before the decision is made, and to execute the rules deterministically. The rule is the explanation. The decision is the execution of the rule against the input. The audit trail is the link between them. Confidence can be logged as metadata about input quality or model agreement, but it does not substitute for the rule. This is the design philosophy of neurosymbolic AI platforms like Kognitos, where automations are written in plain English (English-as-code), executed deterministically, and produce audit trails that map directly to ECOA, GDPR Article 22, EU AI Act, COSO, and PCAOB requirements. Probabilistic AI platforms can approach this with structured prompt-and-policy frameworks, but the architectural starting point matters: it is easier to build audit-ready AI from deterministic foundations than to retrofit it onto probabilistic ones.

Kognitos