
a New paper This large language model (LLM) studies tool usage in agents, Google and UC Santa Barbara researchers have developed a framework that enables agents to make more efficient use of tool and compute budgets. The researchers introduce two new techniques: a simple "Budget Tracker" and called a more comprehensive framework "Budget-Aware Test Time Scaling." This technique implicitly informs agents of their remaining reasoning and allowance for tool use.
As AI agents rely on tool calls to perform real-world tasks, test time scaling has become less about smart models and less about controlling costs and delays.
For enterprise leaders and developers, budget-aware scaling techniques offer a practical way to deploy efficient AI agents without incurring unexpected costs or diminishing returns on compute costs.
The challenge of using a scaling tool
traditional Test time scaling Focuses on giving models "think" However, for longer, agent tasks such as web browsing, the number of tool calls directly determines the depth and breadth of the search.
This introduces significant operational overhead for businesses. "Tool calls such as web page browsing result in higher token consumption, increase context length and introduce additional time delays," The paper’s co-authors, Zhifeng Wang and Tengxiao Liu, told VentureBeat. "The tool call itself introduces additional API cost."
The researchers found that providing agents with more test time resources did not guarantee better performance. "In an intensive research work, if the agent has no sense of the budget, it often goes blindly down," Wang and Liu explained. "It gets a somewhat relevant lead, then spends 10 or 20 tool calls digging into it, only to realize that the whole way was a dead end."
Optimizing Resources with Budget Tracker
To evaluate how they could optimize the device usage budget, the researchers first tried a lightweight approach. "Budget Tracker." This module acts as a plugin that provides a constant signal of resource availability to the agent, enabling the use of a budget-aware tool.
The team speculated on this "Providing clear budget signals enables the model to internalize resource constraints and adapt its strategy without the need for additional training."
Budget Tracker works entirely on an immediate level, making it easy to implement. (The paper provides full details on the indicators used for the budget tracker, making it easy to implement.)
In Google’s implementation, Tracker provides a brief policy manual outlining budget regimes and corresponding recommendations for use of the tools. At each step of the response process, the budget tracker clearly informs the agent of its resource consumption and remaining budget, helping it condition subsequent reasoning steps on the latest resource state.
To test this, the researchers experimented with two paradigms: sequential scaling, where the model improves its output by iteration, and parallel scaling, where multiple independent runs are conducted and aggregated. They experimented with search agents equipped with search and browse tools following a reactive loop. Reaction (reasoning + acting) is a popular method where the model alternates between internal thinking and external actions. To track a true cost-efficiency scaling trend, they developed a unified cost metric that jointly accounts for the costs of both internal token consumption and external device interactions.
They tested BudgetTracker on three information-seeking QA datasets that require external search, including BrowseComp and HLE search, using models such as Gemini 2.5 ProGemini 2.5 Flash, and Claude Sonnet 4. Experiments show that this simple plugin improves performance under various budget constraints.
"Adding Budget Tracker achieves comparable accuracy using 40.4% fewer search calls, 19.9% ​​fewer browse calls, and reducing overall cost… by 31.3%," The authors told VentureBeat. Finally, the budget tracker scale continued to increase as the budget increased, while simple reactions occurred after a certain threshold.
Bat: A comprehensive framework for budget-aware scaling
To further improve the optimization of tool utilization resources, the researchers introduced budget-aware test time scaling (BATE), which is designed to maximize an agent’s performance under any given budget. The bat maintains a constant signal of remaining resources and uses this information to dynamically adapt the agent’s behavior as it shapes its response.
Bat uses several modules to orchestrate agent actions. The planning module adjusts the phased effort to match the current budget, while the verification module decides whether "Dig deep" In a hopeful lead or "Axis" On alternative routes based on resource availability.
Given an information-seeking query and a tool call budget, Bates begins by using the planning module to formulate and make a systematic action plan. When tools are invoked, their responses are added to the reasoning sequence to provide new context evidence. When the agent proposes a candidate answer, the validation module verifies it and decides whether to continue with the current sequence or start a new attempt with the remaining budget.
The iteration process ends until the budget resources are exhausted, at which point a judge as LLM selects the best answer among all verified answers. During execution, the budget tracker continuously updates both the resource usage and the remaining budget at each iteration.
The researchers tested the bats on the BrowsComp, BrowsComp-ZH, and HLE search benchmarks against baselines including standardized reactions and various training-based agents. Their experiments show that BAT achieves higher performance using fewer tool calls and lower overall cost than competitive methods. Using the Gemini 2.5 Pro as the backbone, Beets achieved 24.6% accuracy on BrowseComp compared to 12.6% for Standard React, and 27.0% on HLE Search compared to 20.5% for React.
Bat not only improves effectiveness under budget constraints but also achieves better cost-performance trade-offs. For example, on the BrowseComp dataset, Bits achieved a higher accuracy at a cost of about 23 cents compared to a parallel scaling baseline that required more than 50 cents to achieve similar results.
According to the authors, this efficiency makes previously expensive workflows feasible. "This opens up a range of long-horizon, data-driven enterprise applications… such as complex codebase maintenance, due diligence investigations, competitive landscape research, compliance audits, and multi-step document analysis," He said.
As enterprises look to deploy agents to manage their resources, the ability to balance accuracy with cost will become a critical design requirement.
"We believe that the relationship between reasoning and economics will become inseparable," Wang and Liu said. "In the future, (models) should reason about value."