Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of capturing additional information than their predecessors. However, OpenAI What is the report? that GPT-5.4, the current flagship, is actually worse at contextual searches than GPT-5.2, an older version;
Ideally, Bain says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That can be a heavy lift, especially given how fast the AI world moves and how long human studies can take. Ben’s own study used GPT-4o, which came out about a year ago and is now outdated.
Earlier this month, Google released a study that met Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot not yet available to the public, before meeting with a human physician. Overall, AMIE assessments were as accurate as those of physicians, and none of the discussions raised safety concerns for the researchers.
Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While research has advanced, there are significant limitations that must be overcome before systems for diagnosis and treatment can be translated into the real world, including further research into equity, fairness and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind. Google recently revealed that Health100, a health platform it’s building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though the tool likely won’t be for diagnosis or treatment.
Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multi-year studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There are many reasons why the clinical trial paradigm doesn’t always work in creative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there standards (from) a trusted third party that we can agree are meaningful, that labs can hold themselves to?”
That key is “Third Party”. No matter how extensively companies test their products, it’s hard to fully trust their results. Not only does third-party review bring impartiality, but if many third parties are involved, it also helps prevent blind spots.
OpenAI’s Singhal says he favors external evaluation. “We do our best to support the community,” he says. “Part of why we rolled out HealthBench was really to give the community and other model developers an example of what a really good evaluation looks like.”
Given how expensive it is to develop a high-quality assay, he says, he doubts any individual academic laboratory will be able to afford what he calls “one assay to rule them all.” But he talks a lot about efforts that academic groups have made to bring together existing and novel assessments into comprehensive assessment suites—such as Stanford’s MedHELM framework, which tests models on a variety of clinical tasks. Currently, OpenAI’s GPT-5 has the highest MedHELM score.