Useful browser automation: why Operator isn't enough

How VLM browser automation fits in the software engineering paradigm

May 23, 2025

I’ve been spending the past few months diving into VLM browser use. It’s an exciting technology, and I’ve found myself to be pleasantly surprised by what it can do with very little input. Yet I’ve come to the conclusion that these surprises are the opposite of what we really want. Today, I am laying out an argument for a framework that needs to exist in browser automation: moving from capabilities towards engineering – systems built on robust software principles that developers can actually trust and integrate.

The goal of this is simple: how can we make AI-powered browser automation reliable, predictable, and truly useful in production environments.

My belief is that a 1. constrained developer experience, which leads to a 2. composable, well-defined API, and getting the messiness of browsers out the way through 3. scalable and managed browser infrastructure is the key to useful browser automation.

The current state-of-the-art

A Quick Look Back: RPA and Web Scraping

Browser automation isn’t new technology, and is in fact a $100B industry that goes back to the beginnings of the internet. Prior to VLM systems, browser automation had two key use cases:

Robotic Process Automation (RPA): RPA is used in automating tasks within stable UIs, especially where 100% accuracy is paramount. Think repetitive back-office tasks like transferring data between spreadsheets and legacy systems, updating customer records, or submitting standardized forms across multiple platforms. Its strength lies in mimicking precise, rule-based human actions. However, RPA is 1. expensive and often requires teams of implementation engineers to make successful 2. often not worth the maintenance effort when UIs are dynamic and frequently change
Web Scraping: On the other hand, web scraping has been invaluable for data collection at scale. Unlike RPA, which mutates state in order to automate workflows, web scraping is purely about extracting data. For many scraping use cases, achieving 80% data accuracy or coverage is often good enough to derive valuable insights, making it resilient to minor site changes but less suitable for transactional integrity.

How VLM systems are changing this paradigm

VLM systems like Operator are bridging the middle. A few benefits they bring:

Runtime Decision Making: Unlike pre-programmed scripts, VLMs can make decisions on the fly. It can interpret content, understand context, and choose the next best action based on the current state of the browser, much like a human would.
Vision – Understanding Like a Human: VLMs allow automation tools to "see" and interpret web pages visually. They can identify elements not just by their DOM structure but by their appearance and context (e.g., "click the green 'Submit' button next to the summary"). This makes them more resilient to minor UI changes.
Providing Memory and Context: VLMs can maintain context across multiple steps and even sessions. It can "remember" previous actions, user preferences, or data gathered earlier in a workflow to inform current decisions, leading to more coherent and intelligent task completion.
Generating New Paths and Retrying: When faced with an unexpected error or a changed UI, VLMs can attempt to find alternative ways to achieve its goal or intelligently retry actions, rather than simply failing.

This has unlocked a big set of use cases that include:

Handling dynamic UIs: SaaS apps are changing all the time— the average frontend deployment frequency has gone from every month 10 years ago to daily or even multiple times daily today. VLM systems are tolerant to these changes.
Interacting with non-standardized (but similar) systems: In industries like insurance, healthcare, or government services, no two vendors use the same portal design or naming conventions. VLM-driven agents can visually interpret and complete forms even when field names vary (“Patient ID” vs. “Member Number”), layouts shift, or extra fields appear.
Customizing behavior based on the user identity: Instead of a static automation script, VLM systems can be provided with user or organization-specific context to behave differently. For example, ERP and CRM systems are notorious for extreme customization and unintelligible schema names, but a VLM system can intelligently guess the semantic model of an ERP software to adapt intelligently.

Many have been building products on these use cases. However, past the prototype stage, the inevitable question arises: “How do we make this work reliably in production?”

Software engineering is dead, long live software engineering

Why is productionizing hard? Despite the impressive demos, VLM browser automation face these current challenges:

Debugging UIs at scale is hard: User interfaces are inherently visual and stateful. When an automation breaks, understanding why can be incredibly difficult. Was it a timing issue? Did an element not load? Did the AI misinterpret something? Debugging often involves individually re-running, visually inspecting, and trying to recreate flaky states – this quickly becomes unscalable beyond a couple of hundred calls.
End-to-end workflows will always require components that outside of the browser: Need to process emails, manipulate local files, or trigger backend workflows? Real-world automation frequently spans beyond the DOM, and handling those cross-context interactions reliably adds complexity that most VLM-driven agents aren’t designed to manage out-of-the-box.
Costs matter in production: Running VLMs are not cheap. Some systems I’ve experimented with rack up dollars of costs for a short (<20 click) interaction. It’s fine for testing and prototyping, but running even thousands of these can add up.

Back to basics— software engineering fundamentals

The beauty of software engineering is that some fundamentals never change. With each wave of new technology across the stack, software engineers keep going back to the same design patterns— well-defined interfaces, decomposability, and compile-time guarantees. A few that are notable for me:

Dynamically typed languages —> statically typed languages. Python and Ruby were hugely popular dynamically typed languages that emerged in the 2010s. But over time, Ruby lost momentum as teams gravitated toward statically typed alternatives like Go and Rust. Python, meanwhile, still exists in production systems largely by evolving to support optional static typing (via type hints, mypy, etc.), which became essential for scaling.
NoSQL —> SQL. NoSQL remains a great fit for specific use cases and at a certain scale (i.e. prototyping features that might not exist tomorrow). But in practice, I’ve lived through too many painful migrations to SQL to ignore the value of a schema. Spending a few extra hours defining a data model up front often saves far more time than it costs (i.e. hundreds of develop hours and morale-crushing database operations down the line).
GraphQL —> REST. Again, GraphQL has its uses, but when it comes to preventing uncontrolled fanout, effective caching, and predictable performance, teams inevitably resort to REST to scale systems. I’ve seen too many databases go down because of (On^2) query explosion from “simple” GraphQL interfaces.

One common theme amongst these examples is that there is a fundamental tradeoff between expressiveness and efficiency. Systems that can do it all will never be reliable, maintainable, or cheap over time.

What does this mean for browsers?

Ultimately, useful browser automation will have to make the same set of tradeoffs. This means the following boring principles:

Decomposition and Defined Interfaces: Individual browser automation tasks should be designable as components that can be reliably combined to build more complex workflows. Each component should have clear inputs and outputs.
Interoperability with existing software systems: Instead of just "performing actions" on a UI, browser automation components should expose APIs so that other systems or developer code can interact with predictably. This means clear contracts for what a piece of automation does, what data it expects, and what data it returns.
Failing readily and loudly: In most production systems, writing code to handle errors takes up pretty much the same amount of time as writing the happy path in the first place. Monolith VLM systems don’t provide affordances to handle errors, and will happily go on to do what it thinks is right without ever telling you. This is pretty bad— imagine taking real production data with sensitive information and inputting it into the wrong system because the VLM system mistook the “Contact us” form with the “File a medical claim” form.

I believe the path forward is an approach that treats AI-driven browser interaction not as a monolithic "operator," but as a powerful capability to be wrapped in well-defined, engineered components.

A sketch of an ideal experience

As a developer who wants to build functionality using browser automation, I want to:

Show a system the UI steps I need to complete a task
Have the system automatically generate REST API(s) to achieve this
1. i.e. form-filling is always a POST request
Deploy the service automatically
Call, monitor, and integrate this like any other managed service

That’s it.

Common objections:

Won’t Operator just get good enough to do this?
Yes, but OpenAI has more things to do (like build a new category of consumer hardware) than to maintain a set of rotating proxies so browser don’t get blocked by Cloudflare.
Won’t systems just build APIs?
Actually, no. API <> UI parity is incredibly difficult (maybe impossible?). They ultimately are designed for different use cases, and UI has primacy in 99% of SaaS software. This is not to mention the many legacy systems that will never prioritize APIs (I know, I used to work in government).
Won’t engineering teams just build this system themselves?
I think the research needed to translate multi-step browser/UI actions into well-defined APIs isn’t trivial to get to the 99.5% accuracy needed for these systems. Plus, launching, maintaining, and scaling services is always painful.
Won’t X/Y/Z framework/company just solve this?
I would be more than happy to be proven wrong! Please let me know about it and I can move on with my life.

Call To Action

If you’ve managed to read till the end— it means you’re pretty invested to make this work. I'm currently building a system based on these principles, focusing on making AI-driven browser automation a truly developer-first experience.

If you're an engineer or researcher who’s interested in:

The frontiers of VLM research and its practical application.
The challenges of scaling browser infrastructure for complex AI interactions.
Crafting genuinely useful and robust developer experiences for browser automation.

...then I'd love to connect and hear your thoughts.

Consider what primitives you would need.
Critique the assumptions made here.
Share your real-world use cases and pain points.

You can reach me at yangzi@yzdong.me. Let’s figure out how to build browser automation that we can all depend on.

Zi’s Substack

Discussion about this post