Gaps in First-Party and Third-Party AI Model Evaluations
AI accountability would be supported by more consistent and comprehensive model transparency
A group of researchers with the EvalEval Coalition recently published a new paper on arXiv: “Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations” where they present an analysis of evaluations of AI models with respect to social impacts. The analysis exposes gaps between evaluations run by model developers themselves versus third party evaluations, highlighting a need for transparency reporting standards and regulations.
The crux of the analysis is in comparing 186 first-party reports that were part of model releases by model developers to 183 post-release evaluations that were run by various third parties. These reports were assessed based on the level of detail provided in evaluations of any of seven social impact dimensions as identified by Solaiman et al (2023). The seven dimensions assessed were Bias and Harm, Sensitive Content (e.g. outputting hate speech), Performance Disparity (e.g. unequal results across subpopulations), Environmental Costs and Emissions, Privacy and Data, Financial Costs, and Moderation Labor (e.g. working conditions of data annotators). The rating scale ranged from a 0 (no evaluation present), 1 (vague mention), 2 (concrete results but limited clarity on methods and context), and 3 (sufficient detail to understand and contextualize the evaluation). All the ratings are available here.
The main take-away is that third party evaluations were considerably more detailed, on average, than first-party evaluations (2.62 vs. 0.72 on the 0-3 scale). The implication is that the tech companies and other organizations training models are not releasing as much detail about their evaluations of social impacts in comparison to third parties who run evaluations. The authors note that the most popular models from the US (and to a lesser extent China) tend to attract the most third party evaluations, exposing a gap in evaluation of less-popular models. They also note that certain impact types such as data and content moderation impacts (as well as some others like environmental impacts) are not prevalent at all and are almost entirely absent from third-party evaluations, exposing the reality that third-parties just do not have access to the information they would need to properly evaluate certain issues.
The take-aways for policy here seem pretty clear. First-party evaluations of models by model providers are insufficient when it comes to evaluations of social impacts. There is a fair bit of variance in what level of attention different models receive and what dimensions of social impact are evaluated at all. Transparency standards are needed to provide more consistency and expectations for what evaluations need to be run and how, or which data needs to be disclosed so that third parties can cover more terrain with their evaluations. In addition, there need to be standards around which models demand a full evaluation. And there needs to be sufficient capacity in the evaluation landscape of third parties to be comprehensive. Advancing consistent transparency standards for AI models would support AI accountability by providing the information needed by different accountability forums.
References
Reuel A, Ghosh A, Chim J, et al. (2025) Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations. arXiv. https://arxiv.org/abs/2511.05613
Solaiman I, Talat Z, Agnew W, et al. (2023) Evaluating the Social Impact of Generative AI Systems in Systems and Society. arXiv. https://arxiv.org/abs/2306.05949v2
