Databricks research shows that building better AI judges isn't just a technical concern, it's a people issue

Databricks research shows that building better AI judges isn’t just a technical concern, it’s a people issue

The intelligence of AI models is not what is blocking enterprise deployments. It is unable to define and measure quality in the first place.

This is where AI judges are now playing an increasingly important role. In AI evaluation, a "Judge" An AI system is an output from another AI system.

Judge Builder is Databricks’ framework for building judges and was previously deployed as part of the company Agent Bricks Technology earlier this year. The framework has evolved significantly since initial launch in response to direct user feedback and deployments.

Early versions focused on technical implementation but customer feedback revealed that the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three primary challenges: getting stakeholders to agree on quality standards, getting domain expertise from limited subject matter experts and deploying an evaluation system at scale.

"Model intelligence is usually not a constraint, models are really smart," Jonathan Frankel, chief AI scientist at Databricks, told VentureBeat in an exclusive briefing. "Instead, it’s really about asking, how do we get the model to do what we want, and how do we know if it’s what we want?"

The ‘Youroboros Problem’ of AI Evaluation

Judge Bilder addressed what a DataBricks research scientist, who led the development, called Pallavi Coppola. "The Ouroboros Problem." An ouroboros is an ancient symbol depicting a snake eating its own tail.

Using AI systems Using AI systems creates a circular validation challenge.

"You want to judge whether your system is good, if your AI system is good, but then your judge is also an AI system," Coppola explained. "And now you’re saying, well, how do I know if this judge is good?"

The solution is measuring "Human expert ground truth distance" As the primary scoring function. By reducing the gap between how an AI judge scores outputs versus how domain experts score them, organizations can rely on these judges as scalable proxies for human evaluation.

This approach is fundamentally different from the traditional one Guardrail system or single metric assessment. Instead of asking whether the AI output passed or failed a generic quality check, JudgeBuilder develops highly specific evaluation criteria tailored to each organization’s domain expertise and business needs.

The technical implementation also sets it apart. Judge Builder integrates with Databricks’ ML Flow and Quick fix tools and can work with any basic model. Teams can control their judges, track performance over time and assign multiple judges to different quality dimensions simultaneously.

Lesson Learned: Building Judges That Actually Work

Databricks’ work with enterprise customers revealed three key lessons that apply to everyone building AI judges.

Lesson One: Your experts don’t agree as much as you think. When quality is subjective, organizations find that even their own subject matter experts do not agree on what constitutes acceptable output. A customer service response may be factually correct but use an inappropriate tone. A financial summary is possible but too technical for the intended audience.

"The biggest lesson of this whole process is that all problems become people’s problems." Frankel said. "The hardest part is getting an idea out of a person’s mind and into something tangible. And the hard part is that companies are not one brain, but many brains."

Fixes are interpreted with inter-rater reliability checks. Teams interpret examples in small groups, then measure agreement scores before moving on. This quickly leads to misunderstandings. In one case, three experts gave ratings of 1, 5, and neutral to the same product before discussion revealed that they were interpreting the evaluation criteria differently.

Companies using this approach achieve inter-rater reliability scores of over 0.6, compared to a typical score of 0.3 from external interpretation services. Higher agreement directly translates to better judge performance because there is less noise in the training data.

Lesson Two: Break down vague criteria into specific judges. Instead of a judge checking to see if there is an answer "Relevant, factual and comprehensive," Create three separate judges. Each targets a specific quality aspect. This granularity is important because fail "Overall quality" The score shows what is wrong but not what is right.

The best results come from the above requirements such as regulatory constraints, stakeholder preferences, and bottom-up discovery of observed failure patterns. A customer made a top-down judge for accuracy but discovered through data analysis that the correct response always referred to two retrieval results. This insight became a new production-friendly judge that could proxy for accuracy without the need for ground-truth labels.

Lesson Three: You Need Fewer Examples Than You Think Teams can form strong judges from just 20-30 well-chosen examples. The key is choosing edge cases that expose disagreement rather than clear examples where everyone agrees.

"We’re able to run the process in as little as three hours with some teams, so it really doesn’t take that long to get a good judge," Coppola said.

Production results: Seven-figure deployments from pilots

Frankel used three metrics data bricks to measure JudgeBuilder’s success: whether users want to use it again, whether they increase AI spending and whether they progress further along their AI journey.

On the first metric, a customer created more than a dozen judges after their initial workshop. "This customer made over a dozen judges after doing it rigorously for the first time with this framework." Frankel said. "They really went to town on the judges and are measuring everything now."

For the second metric, the business impact is clear. "There are many customers who have gone through this workshop and become seven-figure spenders on Genie at Databricks that they weren’t before." Frankel said.

The third metric shows the strategic value of Judge Builder. Users who were previously hesitant to use advanced techniques like reinforcement learning now feel confident in deploying them because they can measure whether improvements have actually occurred.

"There are clients who have done very innovative work after these judges and have done very innovative work where they were reluctant to do so before." Frankel said. "They’ve been able to learn reinforcement with us by doing a little quick engineering. Why spend money on reinforcement learning, and why spend energy on reinforcement learning if you don’t know if it really made a difference?"

What should businesses do now?

Teams successfully move AI from pilot to production with judges treating judges not as one-time prototypes, but as assets that grow with their systems.

Databricks recommends three practical steps. First, focus on high impact judgments by identifying an observed failure mode in addition to an important regulatory requirement. This becomes your initial judging portfolio.

Second, build a lightweight workflow with subject matter experts. A few hours reviewing 20-30 edge cases provides sufficient calibration for most judges. Use batch interpretation and inter-rater reliability checks to deconstruct your data.

Third, schedule regular judge reviews using production data. New failure modes will emerge as your system evolves. Your judge portfolio should be ready with them.

"A judge is a way to evaluate a model, it’s also a way to build a guardrail, it’s a way to have a metric against which you can quickly optimize and it’s also a way to have a metric against which you can do reinforcement learning," Frankel said. "Once you have a judge that you know represents your human taste in an empirical form that you can question as much as you want, you can use it in 10,000 different ways to measure or improve your agents."

Editor's pick

Get latest news

Databricks research shows that building better AI judges isn’t just a technical concern, it’s a people issue

The ‘Youroboros Problem’ of AI Evaluation

Lesson Learned: Building Judges That Actually Work

Production results: Seven-figure deployments from pilots

What should businesses do now?

Dream Archivist. 2025.11.04 | By YU-CHUAN TSENG | November, 2025

How __Proto__, Prototype, and Inheritance Actually Work in JavaScript

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news

How Proto, Prototype, and Inheritance Actually Work in JavaScript