
Just a few weeks ago, Google launched it Gemini 3 model, claiming that it has achieved leadership in several AI benchmarks. But the challenge with vendor-supplied benchmarks is that they’re just that—vendor-supplied.
A new vendor unbiased evaluation from ParasiticHowever, that puts the Gemini 3 at the top of the leaderboard. It is not based on a set of academic standards. Rather, it is based on a set of real-world attributes that actual users and organizations care about.
It was founded by researchers at Oxford University. The company provides high-quality, reliable human data to power rigorous research and ethical AI development. company’s “This is the benchmark“Representatives apply this approach by using human sampling and blind testing to rigorously compare AI models across a variety of user scenarios, measuring not only technical performance but also user trust, adaptability and communication style.”
The latest Human test evaluated 26,000 users in a blind test of models. In the evaluation, the Gemini 3 Pro Trust score increased by 16% to 69%, the highest ever recorded. The Gemini 3 now ranks first overall in trust, ethics and safety across demographic subgroups, compared to its predecessor, the Gemini 2.5 Pro, which was only 16 percent of the time.
Overall, the Gemini 3 ranks first in three evaluation categories: performance and reasoning, interaction and adaptability, and trust and safety. It lost only to communication style, where DeepSec V3 topped the preference at 43%. The Human test also showed that the Gemini 3 performed consistently well across 22 different demographic user groups, including variations in age, gender, ethnicity and political orientation. The evaluation also found that consumers are now five times more likely to choose a model in a head-to-head blind comparison.
But ranking matters less Why? It won.
"It’s consistency across a very wide range of different use cases, and a personality and a style that appeals to a wide range of different user types," William Bradley, co-founder and CEO of Prolific, told VentureBeat. "Although in certain instances, other models are preferred either by small subgroups or by a particular discourse type, it is the breadth of knowledge and flexibility of the model across a range of different use cases and audience types that has allowed it to win this particular standard."
How blind testing reveals whether academics are missing benchmarks
Humaine’s methodology exposes the flaws in how the industry evaluates models. Users interact with two models simultaneously in a multi-turn conversation. They don’t know which vendors power each response. They discuss whatever topics are important to them, not predetermined test questions.
It is the pattern itself that matters. We use representative sampling in the US and UK populations, controlling for age, sex, race and political orientation. This reveals something that static benchmarks cannot achieve: model performance varies by audience.
"If you take an AI leaderboard, the majority of them can still have a fairly stable list," Bradley said. "But for us, if you control for audience, we end up with a slightly different leaderboard, whether you’re looking at left-leaning samples, right-leaning samples, US, UK. And I think age was actually the most different described condition in our experience."
For businesses deploying AI across diverse employee populations, this makes a difference. A model that performs well for one demographic may perform poorly for another.
The methodology also addresses a fundamental question in AI evaluation: Why use human judges when AI can evaluate itself? Bradley noted that his firm uses AI judges in some use cases, though he emphasized that human evaluation is still an important factor.
"We see the greatest benefit coming from the smart orchestration of both LLM judge and human data, both having strengths and weaknesses that, when intelligently combined, perform better together." Bradley said. "But we still think that human data is where the alpha is. We are still very fast that human data and human intelligence need to stay in the loop."
What Confidence Means in AI Evaluation
Trust, ethics and safety measure the user’s trust in reliability, factual accuracy and responsible behavior. In Hamin’s method, confidence is not a vendor claim or a technical metric—it’s a consumer report after blind conversations with competing models.
The 69% figure represents the probability across demographic groups. This consistency is more important than the overall score because organizations may serve a diverse population.
"There was no awareness that they were using Gemini in this scenario," Bradley said. "It was based on blinded multi-turn responses only."
This distinguishes perceived trust from earned trust. Users judged model outputs without knowing which vendor produced them, eliminating Google’s brand advantage. For customer-facing deployments where the AI ​​vendor remains invisible to end users, this distinction matters.
What should businesses do now?
An important thing businesses should do now when considering different models is to adopt an evaluation framework that works.
"Web-based models are generally difficult to evaluate," Bradley said. "I think increasingly we need a more rigorous, scientific approach to understanding how these models are performing."
Our data provides a framework: tests for consistency in use cases and user statistics, not performance on specific tasks. Blind the test to distinguish model quality from brand perception. Use representative samples that match your actual user population. Plan for continuous evaluation such as replacing models.
For enterprises looking to deploy AI at scale, this means moving beyond "Which model is best?" to "Which model is best for our specific use case, user demographics and desired attributes?"
The rigor of representative sampling and blind testing provides the data to make this determination.