
There is no dearth of generative AI benchmarks developed to measure the performance and accuracy of a given model to complete various helpful enterprise tasks. Coding to Instructions follow to Agent web browsing And Use of the tool. But many of these benchmarks have a major shortcoming: They measure AI’s ability to complete specific problems and applications, not how. The facts The model is in its results—it produces objectively correct information linked to real-world data—especially when dealing with information contained in imagery or graphics.
For industries where accuracy is most important – legal, finance and medical – a standard method for measurement is lacking The facts There has been a significant blind spot.
Today changes: Google’s facts team and its data science unit, Kaggle Released the Facts Benchmark Suite, a comprehensive evaluation framework designed to close this gap.
associated with Research paper shows a more proportional definition of the problem, is distributed "The facts" In two distinct operational scenarios: "Contextual reality" (grounding answers in the data provided) and "The reality of universal knowledge" (retrieving information from memory or the web).
While the headline news is the Gemini 3 Pro’s high-end placement, the deeper story for builders is industry-wide. "Wall of facts."
According to preliminary results, none of the models—including the Gemini 3 Pro, GPT-5, or Cloud 4.5 Ops—managed to break the 70 percent accuracy score across the problem suite. For tech leaders, it’s a signal: the era of the era "Trust but verify" Far from over.
Revamping the benchmark
The Facts Suite goes beyond simple questions and answers. It consists of four separate tests, each simulating a different real-world failure mode that developers encounter in production.
Parametric Benchmark (internal knowledge): Can the model correctly answer trivia-style questions using only its training data?
Search Benchmark (using the tool): Can the model effectively use a web search tool to retrieve and synthesize live information?
Multimodal Benchmark (Vision): Can the model accurately interpret charts, diagrams and pictures?
Grounding Benchmark V2 (context): Can the model adhere strictly to the source text provided?
Google released 3,513 examples to the public, while Kaggle has kept a private set to prevent developers from training on the test data. "Pollution"
Leaderboard: A game of inches
An initial run of benchmarks puts the Gemini 3 Pro in the lead with a composite score of 68.8 percent, followed by the Gemini 2.5 Pro (62.1 percent) and OpenAI’s GPT5 (61.8 percent).
Model | Facts Score (Average) | Search (Strength Capacity) | Multimodal (vision) |
Gemini 3 Pro | 68.8 | 83.8 | 46.1 |
Gemini 2.5 Pro | 62.1 | 63.9 | 46.9 |
GPT-5 | 61.8 | 77.7 | 44.1 |
Grok 4 | 53.6 | 75.3 | 25.7 |
Claude 4.5 Ops | 51.3 | 73.2 | 39.2 |
The data obtained from the fact-finding team has been released.
For builders: "search" Vs "Parametric" Gap
For developers building rag (generation over retrieval) systems, the search benchmark is the most important metric.
The data show a large discrepancy between the models’ ability "you know" Things (parametric) and its potential "Search for" Things (Search) For example, the Gemini 3 Pro scored a high 83.8 percent on search tasks but only 76.4 percent on parametric tasks.
This validates the current enterprise architecture standard: don’t rely on a model’s internal memory for key facts.
If you’re building an internal cognitive bot, real-world results show that tying your model to a search tool or vector database isn’t optional—the only way to push accuracy toward acceptable output levels.
Multimodal warning
The most critical data point for product managers is performance on multimodal tasks. The scores here are universally low. Even the category leader, the Gemini 2.5 Pro, only hits 46.9% accuracy.
Benchmark tasks include reading charts, interpreting diagrams, and identifying objects in nature. With less than 50% accuracy across the board, this suggests that multimodal AI is not yet ready for unsupervised data extraction.
Bottom line: If your product roadmap includes having AI automatically source data from invoices or interpret financial charts without a human loop review, You are likely introducing significant error rates In your pipeline
Why it matters to your stack
The facts benchmark is likely to become a standard reference point for purchasing. When evaluating models for enterprise use, technology leaders should look beyond the composite score and drill down into specific sub-benchmarks that match their use case.
Building a customer support bot? Check out Grounding Score to keep the bot on your policy documents. .
Building a research assistant? Prioritize search scores.
Building an image analysis tool? Proceed with extreme caution.
As the FACTS team noted in their release, "All evaluated models achieved an overall accuracy of less than 70 percent, leaving plenty of headroom for future improvements."For now, the message to the industry is clear: Models are getting smarter, but they’re not yet flawless. Design your system with the assumption that, about one-third of the time, the crude model may simply be wrong.