Traditional testing limits
If AI companies are showing a slowdown in responding to the increasing failure of the benchmark, this is partly because the test score has been so effective for so long.
One of the biggest initial achievements in the modern AI was the amigrant challenge, which is a kind of modern benchmark. Issued in 2010 as an open challenge for researchers, the database placed more than 3 million images to classify the AI ​​system in a thousand different classes.
Significantly, it was completely agnostic in test methods, and any successful algorithm quickly gained credibility, regardless of how it worked. In 2012, when an algorithm called Alexnet broke the GPU training with the unconventional form of the time, it became the main result of modern AI. Many people may have already guessed that Alexnet’s virtual nerve nets would have the secret to unlocking the image identification – but after his score, no one dared to dispute with it. –
In a large part of making this challenge so efficient, there was a slightly practical difference between the actual process of amigrent’s objecting rating and the computer to identify an image from the computer. Even if there were conflicts about methods, no one was in doubt that when a real image identity system was deployed, the most scoring model would benefit.
But in 12 years, AI researchers have applied the same procedure-based approach to normal tasks. The SWE Bench is commonly used as a proxy for a wider coding capacity, while other test benchmarks of the examinations often stand for reasoning ability. This wide scope makes it difficult to get tough about what a specific benchmark steps are – which in turn makes it difficult to use the results responsibly.
Where things are broken
Anka Real, a PhD student who is focusing on the issue of benchmark as part of his research in Stanford, is convinced that the problem of diagnosis is the result of this pressure. “We have moved from the task specific models to a general purpose model,” says Real. “This is no longer about a single task, but a full bunch of tasks, so the diagnosis becomes difficult.”
Like the Jacobs of Michigan University, Real believes that “the main problem with benchmark is even more authentic than implementation,” noting: “Many things are broken in the same place.” For example, as a coding work, it is almost impossible to add every possible scenario to your problem set. As a result, it is difficult to estimate whether a model is scoring better because it is more skilled in coding or because it has manipulated the problem set more effectively. And with so much pressure on the deveelop developers to get a record score, shortcut resistance is difficult.
For developers, the hope is that many specific benchmarks will usually be included in the capable model. But the technique of the Aging AI means that a single AI system can surround a complex row of different models, making it difficult to guess whether a particular work will improve. “There are many more nobs,” says Sish Kapoor, a computer scientist and prominent critic of the AI ​​industry in the AI ​​industry. “When it comes to agents, they have withdrawn in the best ways for diagnosis.”