As a tech reporter, I often ask questions like “Is the DPSIC really better than Chat GPT?” Or “Is an anthropic model any good?” If I don’t feel like turning it into an hour -long seminar, I would usually give diplomatic response: “They are both solid in different ways.”
Most people ask that they are not explaining “good” in any exact way, and it is fair. It is a human being to make sense of something new and seemingly powerful. But this simple question – is this model good? – This is a daily version of a very complex technical problem.
So far, the way we have tried to answer this question is through the benchmark. These models give a fixed set of questions to answer and classify them how fine they are. But just like exams such as SAT (used by many US colleges admission tests), these standards do not always reflect deep abilities. It seems recently as a new AI model falls every week, and every time a company launches, it has fresh scores that show it is defeating the predecessors’ capabilities. On paper, everything seems to improve at all times.
In practice, this is not so easy. Just as grinding for the SAT can increase your score without improving your critical thinking, as well as models can be trained to improve the results of the benchmark without being smart, as Russell Brandon described in his piece for us. As Openi and Tesla AI’s veteran Andjj Carpathi recently said, we are going through a diagnostic crisis – our scoreboard for AI is no longer reflecting that we really want to measure it.
For some important reasons, benchmarks have increased. First, the industry has learned “teaching the test”, training AI models to score well rather than actually improve. Second, widespread data pollution means that models have already seen benchmark questions, or even answers in their training data. And finally, many standards are easily lost. On famous tests such as Superglo, models have already reached or beyond 90 % accuracy, which makes more benefits feel like data noise than meaningful improvement. At this time, the scores don’t stop telling us anything useful. This is especially true in solving high -skilled domains such as coding, reasoning, and complex stem problems.
However, there is a growing number of teams around the world trying to tackle the AI diagnosis crisis.
One result is a new benchmark called Livecodebench Pro. This causes problems with the international algorithmic Olympiads. Competition for Elite High School and University Programs where participants solve challenging issues without external tools. In the first pass on medium -sized unpleasant problems, the high models of AI currently manage only 53 % and 0 % on the toughest people. These are the works where human experts are normal.
In a junior and competitive coding of the NYU, the world finalist, Zhan Zheng, directed the project in which Liu Kodach Pro was developed with a team of Olympiad Medalists. He has published both benchmarks and a detailed study, stating that advanced models such as GPT -4O Money and Google’s Gemini 2.5 are performed at this level compared to the highest 10 % of human rivals. Across the board, Zheng observed a sample: AI is taking the lead in plans and putting works into operations, but it struggles with a newbide algorithm. “It shows that AI is still far from meeting with the best human codes,” he says.
Livecodebench Pro can explain a new upper bar. But what will happen to the floor? Earlier this month, a group of researchers from multiple universities Debate LLM agents should be primarily evaluated based on their risk, not to what extent they perform well. In the real world, the applicable environment is devastating, especially with AI agents-disastrous, deception and British. When money or safety is on the line, a wrong move can make the magic of destruction.
There are more new efforts to solve this problem. Some standards, such as the Arc-AGI, now have a part of their data set to protect AI models from being excessive for testing, a problem is called “over-fitting”. Meta’s Yan Licon has formed a live bench, a dynamic benchmark where questions are created every six months. The purpose is to evaluate models not only on knowledge but on adaptation.
The Chinese Benchmark Project X -Bench, developed by the Hangshan Capital Group (formerly Sequative China), is one of the efforts. I just wrote about it in a story. The X bench was initially built in 2022 – only after the launch of the Chat GPT, as an internal device to evaluate investment research models. Over time, the team extended the system and brought foreign peers. It made part of his question publicly available last week.
The X -Bench is noteworthy for its dual track design, which seeks to eliminate the difference between lab -based tests and real -world utility. The first track assesses technical reasoning skills by examining a model’s STEM knowledge and the ability to perform Chinese language research. The purpose of the second track is to evaluate practical utility – how a model performs well on tasks in sectors like recruitment and marketing. For example, one job asks an agent to identify five eligible battery engineers. Another is that he has matched the brands with more than 800 creators’ pools with influence.
The team behind the X bench has major ambitions. They intend to expand its testing capabilities in areas such as finance, law, and design, and they intend to update the test set in a quarter to avoid stagnation.
This is something I am often surprised about, because a model does not necessarily have to translate into entertainment, informative and creative experience. On average, most questions are probably not becoming rocket science. There is still not much research on a model’s creativity to be effectively assessed, but I would love to know which model will be the best for creative writing or art projects.
Human priority testing has also emerged as an alternative to the benchmark. A rapidly popular platform is Lemrina, which allows users to offer questions and compare the answers to different models as well as different models. Still, it has its flaws in this procedure. Consumers sometimes reward the answer that seems to be more flattering or pleased, whether it is wrong. This can encourage “Sweet Speaking” models and in favor of being found as a result of Skios.
AI researchers have begun to realize that and are starting to admit that the status of AI testing cannot continue. At a recent CVPR conference, NYU’s professor Sining Xi attracted the limited and unlimited sports of historian James Cars to criticize AI research’s hyperactive culture. An unlimited game is open, he said. The purpose is to continue the game. But in AI, a dominant player often drops a huge result, chasing the same narrow topic and mobilizing the wave of follow -up papers. It puts a lot of pressure on the publisher culture researchers from the generation and increases the pace of short -term, short -term wins than long -term insights. He warned, “If Academia chooses to play a limited game,” he would lose everything. “
I found its structure powerful – and may also apply to the benchmark. So, do we really have a comprehensive scoreboard how good the model is? Not really Many dimensions – socially, emotional, inter -interconnected – also avoid diagnosis. But the new benchmark wave indicates in a shift. As the field is ready, a little doubt is probably healthy.
This story originally appearedAlgorithmOur weekly newsletter on AI. To get such stories in your inbox first,Sign up here.