A stylized web browser window on a purple gradient background displaying the Gemini logo in the center. Bold text below reads “How Google’s Gemini 2.5 Use Your Web Browser Explained” in white and yellow, representing an article about Google’s new Gemini 2.5 Computer Use model.

Gemini 2.5 ‘Computer Use’: Can This Model Automate Your Browser?

Imagine an AI that doesn’t just understand your words but can actually use your computer. An AI that sees your screen, understands your goal, and navigates complex websites to get things done for you. This is the promise of Google’s new Gemini 2.5 ‘Computer Use’ model, released in preview on October 7, 2025, a powerful new agent that’s-stepping out of the chat box and into your browser.

But how does this technology actually work behind the scenes? What kinds of tasks can you realistically automate with it right now? And most importantly, what are the critical safety rails and limitations you need to understand before you start building? This article will answer those questions and give you a clear look at the future of web automation.

The Key Takeaways

  • Gemini 2.5 ‘Computer Use’ is a preview model that analyzes screenshots and returns step-by-step UI actions like clicks and typing for your code to execute.
  • It is designed specifically for browser automation. It does not support control over your operating system (OS).
  • Safety is a core feature. The model requires human confirmation for risky actions, making a “human-in-the-loop” approach essential.
  • While it shows strong performance in benchmarks, it’s still a preview, so perfect reliability isn’t guaranteed.

How It Works: The Screenshot-to-Action Loop

The core of the Computer Use model is a simple but powerful cycle called the “screenshot-to-action loop.” Here’s how it works:

  1. You provide a goal and a screenshot. You tell the model what you want to achieve and show it the current state of your web browser.
  2. The model returns actions. Instead of just a text reply, the model uses function calling for UI actions, giving your code specific commands like ‘click at’ or ‘type text at’. The model can also return multiple actions in one turn (parallel function calling) to reduce round-trips.
  3. Your code executes the actions. Your application, typically using a tool like a Playwright executor in Python, performs the requested click or type command in a real browser.
  4. You send back the result. After the action, your code takes a new screenshot and sends it back to the model as a “function response,” which includes the new image and the current URL.

This “agent loop” repeats, allowing the model to see the results of its actions and decide on the next step until the task is complete. To simplify setup you can use third-party resources like the Stagehand template. All actions use a normalized 1000×1000 grid; your client code is responsible for converting these coordinates to viewport pixels when executing the action.

What You Can Automate Today

The primary strength of this model is web UI action automation. It’s designed for tasks that require a sequence of clicks, text entries, and navigation across different web pages. A perfect practical example is automating data gathering or checkout flows. The model can intelligently handle form filling automation for sign-ups or applications.

For developers, this opens up new possibilities for QA smoke tests using UI testing with LLMs. This technology represents a more intelligent evolution of Robotic Process Automation, creating a new category of RPA with LLM designed specifically for the dynamic environment of the web.

To accomplish these tasks, the model comes equipped with 13 predefined actions:

  • open web browser
  • navigate
  • search
  • go back
  • go forward
  • click at
  • hover at
  • type text at
  • key combination
  • scroll document
  • scroll at
  • drag and drop
  • wait 5 seconds

This toolkit provides a powerful foundation for automating a wide variety of everyday browser-based workflows. However, while these capabilities are extensive, it is just as important to understand what the model cannot and should not do.

Know The Boundaries and Limitations

It’s crucial to understand that this technology is strictly browser-focused. Full OS-level control is not supported, so you can’t ask it to open desktop applications or manage files on your computer. You should think of it as an autonomous browser agent that is still in a preview phase.

This preview status means that web agent reliability is not yet perfect. The model can sometimes make mistakes. Because of this, it is essential to follow these rules:

  • Never attempt to automatically solve or bypass CAPTCHAs. Google’s official guidance states you must use a human-in-the-loop for such security checks and other sensitive acts.
  • Never bypass a ‘require confirmation’ safety gate. Always prompt the user for approval.
  • Always run in a sandboxed browser profile or VM. This isolates its activity from your personal data, passwords, and browsing history, providing an essential safety guardrail.

These limitations are critical to understand, as they lead directly into the model’s most important design feature: its built-in safety system.

Safety First: Built-in Guardrails

Google has built important safety mechanisms directly into the API to prevent unintended actions. For potentially sensitive operations, like submitting a form with personal information, the model is designed to pause and ask for permission. This creates a “risk-action confirmation workflow” that ensures a human is always in control.

Here are the key safety features:

  • Human-in-the-Loop Confirmation: This is the core safety principle. The system is designed to stop and ask a real person for approval before it performs actions with significant consequences, such as sending messages or submitting payment details.
  • The ‘safety decision’ Flag: When the model detects a risky step, its API response will include a “safety decision” with the value “require confirmation.” Your application code must be written to recognize this signal and prompt the user.
  • Safety Acknowledgement: If the user approves the action, your next “function response” must include a “safety acknowledgement” payload.
  • Proactive Action Control: Developers can add another layer of security by providing a list of “excluded predefined functions.”
  • Web Defenses: The model has some built-in “prompt injection defenses” to help it resist being tricked by malicious instructions on a webpage.

Here is how to handle the safety gate: If the model’s response includes a safety decision of ‘require confirmation,’ your code must pause and show a UI to the user. If the user approves the action, your application must then include a ‘safety acknowledgement’ field set to ‘true’ in its next function response before proceeding. If the user denies it, your code should stop the loop.

Ultimately, these features are fundamental to how the tool is meant to be used. Following the recommended Vertex AI safety best practices is essential.

Interesting video about this model. Hands-on demonstration how it works.

Getting Started and Technical Details

To get started, you can get an API key from Google AI Studio, but note there isn’t a dedicated “Computer Use runner” in the Studio interface. Google recommends two main paths for hands-on testing:

  1. Local Reference Implementation: Use the official Python SDK (with Playwright) to build and run the agent loop on your own machine.
  2. Public Demo: For a no-setup trial, you can use the public Browserbase Gemini demo.

For more scalable, production-ready applications, Vertex AI is the recommended environment. During the preview, only the Python SDK is officially supported.

The model’s impressive performance is driven by two key technological advancements:

  • Gemini 2.5 Pro visual reasoning: This is the underlying capability that allows the model to accurately interpret screenshots and identify UI elements.
  • Large context window: During the preview on Vertex AI, the Computer Use model shows limits of approximately 128k input tokens and 64k output tokens. For context, Gemini 2.5 Pro generally supports up to 1M input tokens, so it’s important to check your specific environment’s limits.

A Table of Technical Details

Technical DetailSpecification
Model IDgemini-2.5-computer-use-preview-10-2025
Primary SDKPython (with Playwright executor)
Coordinate SystemNormalized 1000×1000 grid
Input FormatGoal (text) + optional Screenshot (then “function response” with screenshot + URL)
Output FormatFunction Calls (e.g., ‘click at’, ‘type text at’)
Pricing & LimitsBilled under the Gemini 2.5 Pro SKU. Preview limits on Vertex are ~128k input / 64k output.

In summary, getting started involves using the official Python tools or the Browserbase demo.

How It Compares in The Real World

En Gemini 2.5 ‘Computer Use’ is still in preview, early benchmarks suggest it is a highly capable and competitive agent for web-based tasks. The model has been evaluated against other leading AI agents on standardized tests, often run in environments like the Browser Arena comparison. Results can vary based on the exact test setup and environment.

The Key Benchmarks

BenchmarkGemini 2.5 Computer Use ScoreSource / Notes
Online-Mind2Web69.0%Google’s official self-reported score.
Online-Mind2Web~65.7%Score from third-party Browserbase test harness.
WebVoyager~79.9%Score from third-party Browserbase test harness.
AndroidWorld~69.7%Google’s internal test (model is not optimized for mobile).

These numbers indicate that Google’s strategy of focusing narrowly on the browser may be paying off in terms of reliability. For those who want to see it in action, the public Browserbase Gemini demo provides a live playground to test its capabilities on various websites.

This real-world performance, combined with its benchmark scores, positions Gemini 2.5 ‘Computer Use’ as a powerful new tool for building the next generation of web automation.

Conclusión

Gemini 2.5 ‘Computer Use’ represents a significant step toward truly helpful AI agents that can take action on our behalf. It’s a powerful tool for web automation, offering a glimpse into a future where we can delegate complex digital tasks to an AI assistant.

However, its preview status and the absolute necessity of a human-in-the-loop for safety remind us that we are still in the early days. This isn’t a “set it and forget it” technology; it’s a collaborative tool that requires thoughtful implementation and constant oversight for critical actions. The built-in guardrails are not just features, they are the foundation for using this model responsibly.

We will start to see just how capable these browser-based agents can become, moving us closer to a world where AI doesn’t just provide answers, but actively helps us get things done.

Reciba consejos exclusivos sobre inteligencia artificial en su buzón de entrada.

Manténgase a la vanguardia con los conocimientos expertos en IA en los que confían los mejores profesionales de la tecnología.

es_ESEspañol