Google's AI can now surf the web for you, click on the buttons, and fill the form with a Gemini 2.5 computer use

Google’s AI can now surf the web for you, click on the buttons, and fill the form with a Gemini 2.5 computer use

The largest provider of large language models (LLM) has tried to move beyond multi -modal chat boats. "Agent" This can in fact take further steps from the user on the websites. Remember to Open’s Chat GPT Agent (known as previously known "Operator") And the use of anthropic computer has been released in the last two years.

Now, Google is also joining the same game. Today, the search giant Deep Mind AI Lab subsidiary unveiled a new, fine tone and custom train version of its powerful Gemini 2.5 Pro LLM. Known as known as "Gemini 2.5 Pro computer useFor, for, for,." Which one can do Use a virtual browser to surf the web, recover information, fill out the form, and even take action on the websites. – All the user’s only text prompt.

"These are the early days, but the model’s ability to communicate with the web – such as scrolling, filling forms + navigating dropdown. The main next step for building general purpose agents," Said Google CEO Sunder Pachi, As part of A Long statement on the social network, X.

However, the model is not available directly for Google users.

Instead, Google contributed With another company, Browser baseBy virtue Paul Klein, a former Taleo Engineer in the early 2024Which offers virtual "Without the head" Web browser specially used by AI agents and applications. (a "Without the head" The browser is the one that does not require a graphical user interface, or GUI, to visit the web, though in this case and other, the browser base shows graphical representation for the user).

Users can demo the new Gymney 2.5 computer use model directly to the browser base Here And even compare it together with the older, rival offerings in a new of Openi and Entropic "Browser Arena" Launched by Startup (though only one additional model can be selected with Gemini at a time).

For AI builders and developers, it is being made as a raw, although the property llm Through Gemini API in Google AI Studio For Rapid prototy -typingAnd Google Cloud of Vertex AI Model Selector and Applications Building Platform.

Is established on the capabilities of the new offer Gemini 2.5 ProReturned in March 2025, but since then it has been significantly updated, paying special attention to enabled AI agents to directly interact with the user’s interface, including browser and mobile applications.

Overall, it appears The Gemini 2.5 computer is designed to make developers an agent that can complete interface-powered tasks-such as clicking, Tying, Scrolling, Filling Form, and Naving Log In Screens.

Instead of fully relying on APIS or structural input, this model allows the AI system to interact with visual and actively with software, such as human.

Short User Hand on Test

In my short, non -scientific initial hand -on tests on the browser base website, the Gemini 2.5 computer use Taylor Swift’s official website successfully visited according to the guidance, and I was summarized that a special edition of its most new album, which was sold or developed at the top. "A sugar life."

In another test, I asked Amazon to look for high -ranking and well -reviewed solar lights using the Gemini 2.5 computer, which I could put in my previous courtyard, and I was glad to see that it successfully designed Google Search CAPTCHA to eliminate inhuman consumers ("Select all boxes with a motorcycle.") He did so in the matter of seconds.

However, once it passed there, it was dismissed and, despite a service, was unable to complete the work. "Work" Massage

I should also note here that Open AI and Anthropic’s cloud chat GPT agent can create local files – such as PowerPoint presentations, spreadsheets, or text documents – by the user, the Gemini 2.5 computer is not currently used directly to the file system access or access to the file system.

Instead, it is designed to control and navigate the web and mobile user interface such as actions such as clicking, Typing and Scrolling. Its output is limited to the UI’s recommended steps or the response to the chat boot -style text. Any structured output such as a document or file should be handled separately by the developer, often through custom code or third party integration.

Performance Benchmark

Google says the use of Gemini 2.5 computers has shown significant results in several interface control benchmarks, especially when compared to other major AI systems, including Claude Swant and Openi agent models.

The browser base and Google were diagnosed through their own tests.

Some highlights include:

Online mind 2 web (browser base): 65.7 % for Gemini 2.5 v
Web Vejler (Browser Base): 79.9 % Gemini 2.5 v
Android World (Deep Mind): 69.7 % for Gemini 2.5 v Openi model cannot be measured due to lack of access
Os World: Currently, Gemini 2.5 is not supported. The high competitor’s result was 61.4 %

In addition to strong accuracy, Google has reported that the model works less delayed than other browser control solutions.

How does it work

Agents operating through a computer use model work inside the loop of an interaction. They receive:

A user Task Pramp
A screenshot of the interface
History of actions of the past

The model analyzes this input and produces a recommended UI action, such as clicking on a button or typing in a field.

If needed, it can request verification from the last user for dangerous tasks, such as purchase.

Once the action is processed, the interface is updated and a new screenshot is repatriated to the model. The loop continues until the work is complete or withdrawn due to a mistake or safety decision.

Model uses a special tool called computer_useAnd it can be integrated into the customs environment using tolls Drameter Or through the Browser base Demo sandbox.

Use matters and adoption

According to Google, teams internal and externally have already started using models in several domains:

Google Payment Platform Team There are reports that the use of Gemini 2.5 computer successfully retrieves more than 60 % of test executions, which reduces a major source of engineering incompetence.
AutotabThe model said, the model said that the model performed better on the tasks of analyzing complex data to others, which increased the performance of their toughest diagnosis by 18 %.
POCK DOT COMAn active AI assistant provider, noted that the gymnasium models often work 50 % faster Compared to competitive solutions during interface talks.

The model is also being used in Google’s own product development efforts, including Project Marine, Firebase testing agentAnd AI format in search.

Safety measures

Since this model directly controls the software interface, Google emphasizes multi -layered views for safety:

A Per step safety service Inspect every proposed action before implementation.
Developers can explain System level instructions Need to prevent or need to confirm specific actions.
The model includes built -in safe guards to avoid measures that can compromise with security or violate Google’s prohibited use policies.

For example, if the model faces a captcha, it will prepare an action to click on the checkbox but will flag on it as per the need for user confirmation, making sure the system does not move forward without human surveillance.

Technical abilities

The model built -in support supports a wide array of UI actions such as:

click_atFor, for, for,. type_text_atFor, for, for,. scroll_documentFor, for, for,. drag_and_dropAnd more
User -defined functions can be added to enhance its access to mobile or custom environment
Screen coordinates are brought back to normal (0-1000 scale) and translated back into pixel dimensions during processing

It accepts Picture and text Input and outpats The text answers Or Function calls To perform the job. Is the recommended screen resolution for maximum results 1440×900Although it can work with other size.

API pricing resembles almost Gemini 2.5 Pro

For prices Gemini 2.5 computer use The standard Gemini is closely align with the 2.5 Pro model. Both follow the same per token billing structure: Input is the value of the token 25 1.25 per 10 million tokens For indicators less than 200,000 tokens, and 50 2.50 per million tokens For a longer indicator than that.

Output tokens follow a similar partition, which costs 00 10.00 per million For a small response and .00 15.00 For big people.

Where the model diversion is available and additional features.

Gemini 2.5 Pro contains a free level This allows developers to use a model without any price, publishing no clear token cap, though the use platform (such as Google AI studio) may be subject to rate limit or quota barriers.

This free access includes both input and output tokens. Once the developers exceed their allotted quota or switch to the salary level, standard per token prices apply.

On the contrary, Gemini 2.5 computer use is specially available through paid levels. There’s No free access Currently, this model has been offered, and all use of token from the beginning are charged.

According to the feature, Gemini 2.5 Pro supports optional capabilities such as context Kaching (starts at $ 0.31 per million tokens per million) and grounding with Google Search (free for 1,500 daily applications, then $ 35 in additional applications). These are not currently available for computer use.

Another distinction is in the data handling: Output from the computer use model is not used to improve Google products in used levels, while Gemini 2.5 Pro helps improve the free -use model unless it is clearly selected.

Overall, the developers can expect similar token -based costs in both models, but they should consider level access, incorporating capabilities, and data use policies when deciding which model meets their needs.

Editor's pick

Get latest news