LLMs Can’t Provide Faithful Explanations Needed for AI Accountability
A growing array of research points out that the explanations produced by LLMs are not accurate. In the literature this is referred to as explanation faithfulness (Agarwal et al, 2024; Jacovi and Goldberg, 2020) and accurately measuring it is an area of active research (Lyu et al, 2024). Agarwal and colleagues (2024) articulate it as: “An explanation is considered faithful if it accurately represents the reasoning of the underlying model.” A less anthropomorphic way of talking about “reasoning” here would be to say that an explanation is faithful if it accurately describes how the system or model processes an input into an output. Some explanations may be more faithful than others (Jacovi and Goldberg, 2020), with certain interpretable models able to produce more faithful explanations than black-box models (Rudin, 2019).
Explanations rendered by and about AI systems need to be as faithful as epistemically possible in order to support accountability. Buijsman describes the role of explanation in supporting accountability: “when a mistake has been made, the challenge is to find a reason why that mistake happened and the people responsible for fixing it.” (2026). A faithful explanation might help understand whether there may be an issue with faulty data, missing information, or incorrect reasoning, and ultimately help improve the system over time. Explanations that are not faithful could misdirect decision-making about how to assign blame or prevent future harms, frustrate attempts to contest a decision or diagnose mistakes and logical errors so they can be corrected, and ultimately to appropriately sanction actors if the explanation is unacceptable.
Faithfulness is especially relevant to questions of process accountability, where the goal is to hold an actor in the AI system accountable for how an outcome was computed. Explanations are a diagnostic tool for accountability, describing how inputs lead to the outcome and helping to trace instances of potential negligence or faulty logic in the system. If an unfaithful explanation of a mortgage decision says that you were rejected because your income is too low but the model decision was actually influenced by your race or zip code this undermines your ability to challenge the decision as unacceptably including protected characteristics.
LLMs are not able to provide faithful explanations, such as self-explanations generated by the model to render the “reasoning” behind their output in human-understandable language (Madsen et al, 2024; Mayne et al, 2025; Mutton et al, 2025). Madsen and colleagues (2024) show that larger models with more parameters generally produce more faithful explanations but that there is high variance across tasks. Mayne and colleagues (2025) focus on self-generated counterfactual explanations (SCEs) and indicate that their findings “suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour.” While models may be able to provide counterfactual explanations (e.g. if you change variables X and Y it will flip the decision outcome), these may be trivially true rather than articulating minimal changes to the input that would actually shed light on the decision.
The main implication here is that when accountability matters, such as for high-stakes situations where there is potential for severe impacts, faithful explanations are critical, but LLMs cannot provide such explanations. Policymakers may consider when AI providers need to demonstrate faithfulness of model explanations and establish thresholds around when models can be used in high-stakes contexts. Administrative bodies will also need to develop standardized benchmarks and measurements for faithfulness to support such policies.
References
Agarwal C, Tanneru SH and Lakkaraju H (2024) Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models. arXiv. DOI: 10.48550/arxiv.2402.04614.
Buijsman S (2026) Accuracy is not all you need! The Reasons to Require AI Explainability. Minds and Machines 36(1): 14.
Jacovi, A. & Goldberg, Y. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? Proc. 58th Annu. Meet. Assoc. Comput. Linguistics 4198–4205 (2020) doi:10.18653/v1/2020.acl-main.386.
Lyu, Q., Apidianaki, M. & Callison-Burch, C. Towards Faithful Model Explanation in NLP: A Survey. Computational Linguistics 50, 657–723 (2024).
Madsen A, Chandar S and Reddy S (2024) Are self-explanations from Large Language Models faithful? In: Findings of the Association for Computational Linguistics: ACL, 2024.
Mayne H, Kearns RO, Yang Y, et al. (2025) LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations. In: EMNLP, 2025.
Matton K, Ness RO, Guttag J, et al. (2025) Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations. In: ICLR, 2025.
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).
