Benchmark growth in Hangshan began in 2022, after the breakout of Chat GPT, as an internal tool to evaluate, which is capable of investing in models. Since then, under the leadership of fellow Gong Yuan, the team has permanently enhanced the system, which helps outside researchers and professionals to renovate it. Since the project increased even more sophisticated, they decided to issue it to the public.
The X -Bench approached the problem with two different systems. A traditional bench is similar to marking: an educational test that assesses the qualification of a model on various subjects. The second is like a technical interview period for work, guessing how true a model a model can provide real economic value.
X bench methods to evaluate raw intelligence include currently two components: X Bench science canna and X -Bench D. Science cue is not a primary departure from current postgraduate level stem benchmark GPQA And Super GPQ. This includes questions spread from biochemistry to orbit mechanics, which is developed by graduate students and double checked by professors. Scoring not only the correct answer but also the reward of reasoning that leads to it.
On the contrary, D -Percycle focuses on the ability to navigate on the Chinese language web. Experts in ten articles raised 100 questions in music, history, finance and literature. Scoring is the source of the source, the facts consistency, and the consent of a model when there is no enough data. One question in the advertised collection is, “How many Chinese cities meet a foreign country in the three provinces of the three provinces?” (This is 12, and if you are thinking, only 33 % of the models considered it right.)
On the company’s website, researchers said they want to include more dimensions in the test example, for example, aspects such as how creative a model is to solve its problem, how much cooperation it is when working with other models, and how reliable it is.
The team has pledged to update test questions once in a quarter and maintain half a public, half private data set.
To assess the real world preparation of models, the team worked with experts, initially developed models made on the original work flow, in recruitment and marketing. For example, a job asks a model to source five qualified battery engineer candidates and justify each choice. Another one and more than 800 influence pools ask short video creators to meet advertisers.
The website also teases the next category, including finance, legal, accounting, and design. The question set for these categories has not yet become open.
Chat GPTO 3 is the first in the current professional category. To recruit, search for trouble and cloth 3.5 suits are second and third respectively. For marketing, clouds, gurus, and gymons all perform well.
“It is really difficult for the benchmark to add things that are very difficult to correct,” says a new benchmark named Liu Kodach Pro and a student of a new benchmark in the NYU. “But the X -Bench represents a promising start.”