Join our daily and weekly newsletters for the latest updates and special content related to the industry’s leading AI coverage. Get more information
Businesses need to know whether the models that force their applications and agents work in real -life scenario. This type of diagnosis can sometimes be complicated because specific scenarios are difficult to predict. A reformal version of the Bench Benchmark provides organizations for a better idea of ​​the model’s real -life performance.
Allen Institute of AI (AI 2) Its prize model benchmark, a latest version of the Bench, launched the Award Bench 2, which he claims that he provides a more comprehensive theory of the model’s performance and evaluates how the models are in accordance with the enterprise goals and standards.
AI2 made a prize bench with rating tasks that measure the conditions through the calculations and the flow of the time of individuality. The award bench mainly deals with the reward models (RM), which can work as judges and estimate the LLM output. The RMS assigns a score or “reward” that guides learning reinforcements with human impression (RHLF).
Nathan Lambert, a senior research scientist at AI2, told Venture Bat that the first award bench intended to be launched. Nevertheless, the model environment was rapidly developed, as well as its standards.
“Since the reward models became more advanced and using matters more importance, we quickly recognized with the community that the first version does not fully catch the complexity of real -world human preferences,” he said.
Lambert added that with the award bench 2, “we have come out to improve both diagnosis and depth – which is better to improve more diverse, challenging indicators and procedures so that better reflects how humans actually decide the results of the AI.” He said that the second version uses the unseen human indicators, with a more difficult scoring setup and new domains.
Use of diagnosis for diagnostic models
Although the reward models test how well the models do, it is also important that the RMS is compatible with the company’s values. Otherwise, the process of fine toning and reinforcement can reinforce bad behavior, such as deception, reduce the general, and score harmful reactions.
Six different domains covers in the Bench 2: Facts, precise guidelines, mathematics, safety, focus and relationships.
“Businesses should use the reward bench 2 in two different ways depending on their request. If they are performing RLHF themselves, they should adopt the best methods and datases of well -known models in their pipelines as they need to adopt leading models in their own pipelines because their own pipelines are well -known for their own pipelines. The model can be selected).
Lambert noted that benchmarks such as Benchmarks offer users a way to evaluate the models they are choosing “based on dimensions that are most important to them, rather than relying on a tight one -sized -fitting score.” He said that the idea of ​​performance, which is claimed to evaluate many ways to evaluate, is very scarce because a good response from a model relies heavily on the user’s context and goals. At the same time, human preferences become very important.
AI 2 released the first version Reward Bench in March 2024. At that time, the company said it was the first benchmark and leader board for the award models. Since then, there have been several ways to improve the benchmarking and RM. Researchers MethodCame up with the festival Rewordbench. Depressic Released a new technique called self -principled criticism for smart and expanding RM.
How did the models perform
Since the Bench Bench 2 is the latest version of the Bench, AI2 tested both current and newly trained models to find out if they are high. These include several models, such as Gemini, Claude, GPT -4.1, and Lama -3.1 versions, as well as models such as Datases and Kevin, Skywork, and its own toolo.
The company found that big -reward models perform well on the benchmark as their twenty models are strong. Overall, the strongest performing model Llama-3.1 is the variations of directive. In terms of attention and safety, Skywork data is “especially helpful”, and Tolo performed well on realism.
AI2 said that although he believes the rewards models are a step ahead of “wide, multi -domain accuracy -based diagnosis”, he warned that the diagnosis of the model should be used primarily as a leader to choose models that work best with enterprise requirements.
