AI benchmarks are broken. Instead we need it.

by SkillAiNest March 31, 2026

written by SkillAiNest March 31, 2026

In organizations where this approach has emerged and is beginning to be implemented, the first step is to change the unit of analysis.

For example, in a UK hospital system in the 2021-2024 period, the question of whether medical AI applications improve diagnostic accuracy was broadened to suggest that the presence of AI within hospital multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically examined coordination and deliberation in human teams using and not using AI. Multiple stakeholders (inside and outside the hospital) judged on metrics such as how AI affects collective reasoning, whether it reveals overlooked concerns, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices.

This change is fundamental. This is especially important in advanced contexts where system-level effects are more important than task-level accuracy. It also matters for the economy. This may help to reframe the growing expectations of mass productivity gains that have so far been predicated on the promise of improving mass individual work efficiency.

Once this foundation is established, HAIC benchmarking can begin to embrace the time element.

Today’s standards resemble school exams—once, standardized tests of accuracy. But true professional competence is assessed differently. Junior doctors and lawyers are continuously evaluated within the actual workflow, under supervision, with feedback loops and accountability structures. Performance is assessed over time and within a specific context, as competence is relative. If AI systems are to work alongside professionals, their impact must be tested longitudinally, reflecting how performance unfolds over repeated interactions.

I saw this aspect of HAIC in one of my humanitarian sector case studies. Over the course of 18 months, an AI system was evaluated within an actual workflow, paying particular attention to how detectable its errors were—that is, how easily human teams could identify and correct them. This long-term “error detection record” means that organizations involved can design and test context-specific guardrails to promote confidence in the system, despite the inevitability of occasional AI errors.

A longer-term horizon also reflects system-level outcomes that short-term benchmarks miss. An AI application may outperform a physician on a narrow diagnostic task but fail to improve multidisciplinary decision-making. Worse, it can introduce systemic distortions: anchoring teams too early in plausible but incomplete answers, increasing people’s cognitive workload, or creating downstream failures that negate any speed or efficiency gains an AI might bring. These knock-on effects—often invisible to current standards—are central to understanding real effects.

The HAIC approach, admittedly, promises to make benchmarking more complex, resource-intensive, and difficult to standardize. But continuing to evaluate AI in neat terms, isolated from the world of work, will lead us to misunderstand what it can and cannot really do for us. To deploy AI responsibly in real-world settings, we must measure what really matters: not just what the model can do alone, but what it enables—or undermines—when humans and teams work with it in the real world.

Angela Aristido is a professor at University College London and a faculty fellow at the Stanford Digital Economy Lab and the Stanford Human-Centered AI Institute. She speaks, writes and advises on the real-life deployment of artificial intelligence tools for the public good.

Benchmarks broken

Editor's pick

Get latest news

AI benchmarks are broken. Instead we need it.

Moving to AI model customization is an architectural imperative.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news