Agents ยท Tool Use

Browser Use

LLM agents that control web browsers to navigate pages, fill forms, and extract data โ€” turning any website into an automated tool

Observe โ†’ Act
Core Loop
Playwright
Engine
Any Website
Coverage

Table of Contents

SECTION 01

What Is Browser Use?

Browser Use is the capability of an LLM agent to control a real web browser โ€” Chromium, Firefox, or WebKit โ€” through a programmatic automation layer (typically Playwright or Selenium). The agent perceives the current page state, decides what action to take next (click, type, scroll, navigate), executes that action, and observes the resulting new page state. This loop repeats until the goal is achieved.

The key value proposition: any website becomes a tool without requiring an official API. Browser agents can research competitors, fill out government forms, monitor prices, extract structured data from dynamic JavaScript-rendered pages, automate repetitive web workflows, and interact with internal web tools that lack programmatic interfaces.

Use cases:

Browser Use vs web scraping: Traditional scrapers use fixed CSS selectors or XPath patterns that break when the page changes. Browser agents understand the page semantically ("find the Add to Cart button") and adapt to layout changes without code updates.
SECTION 02

How It Works

The browser agent loop alternates between observation (what is on the page right now?) and action (what should I do next?).

Goal: "Go to Hacker News and extract the top 5 story titles" โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Browser Agent (LLM) โ”‚ โ”‚ โ”‚ โ”‚ Observe: screenshot / DOM of page โ”‚ โ”‚ Reason: where are the story titles?โ”‚ โ”‚ Act: extract_content(selector) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ action โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Browser (Playwright) โ”‚ โ”‚ navigate(url) โ”‚ โ”‚ click(element) โ”‚ โ”‚ type(element, text) โ”‚ โ”‚ scroll(direction) โ”‚ โ”‚ extract_content(selector) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ new page state โ””โ”€โ”€โ–บ back to agent (observe)

The loop terminates when the agent's reasoning concludes the goal is achieved, a maximum step count is reached, or an error cannot be recovered from.

browser-use library: The open-source browser-use Python library (github.com/browser-use/browser-use) provides a high-level agent interface on top of Playwright. It handles the observe-act loop, page parsing, and LLM integration. You provide the goal; it handles the rest.
SECTION 03

Page Representation

How you represent the page to the LLM is the most important engineering decision in a browser agent. The LLM cannot see the browser directly โ€” it receives a processed view of the page state. Three main representations exist, each with tradeoffs.

1. Screenshot (vision): Send a screenshot of the current browser viewport to a multimodal LLM. The model reasons visually about what to click. Pros: works on any page including canvas-heavy UIs. Cons: expensive per step, coordinates can be imprecise, visual context may miss hidden elements.

2. Accessibility tree (a11y tree): Parse the browser's accessibility tree โ€” a structured representation of interactive elements (buttons, links, inputs) with their labels and roles. Much more compact than a screenshot and easier for the LLM to reason about. Most production browser agents use this.

3. Simplified DOM: Strip the raw HTML to just essential elements (links, buttons, inputs, headings) and their text. Sits between screenshot and full DOM in terms of information density.

# Accessibility tree representation (simplified): [1] Link "Y Combinator" (href="/news") [2] Link "new" (href="/newest") [3] Link "past" (href="/front?day=2026-03-30") [4] Span "1." [5] Link "Show HN: An open-source coding agent" (href="item?id=...") [6] Span "312 points by user42 | 147 comments" ... # Agent reasoning: "I can see story [5] with link text. I'll extract [5] through [50]." # Agent action: extract_content("[4]-[50]")
Tradeoff summary: Use screenshots for visual-heavy pages (maps, charts, canvas apps). Use accessibility trees for standard web pages. Never send the full raw DOM โ€” even a simple page can be 500KB of HTML that overwhelms the context window.
SECTION 04

Action Space

The action space defines what the agent can do. A well-designed action space is expressive enough to handle most web tasks but constrained enough that the agent uses actions correctly.

Standard browser actions:

Stopping action: Always include a done(result) action. Without it, the agent has no way to signal completion and may loop indefinitely checking if it's done.
SECTION 05

Quickstart Example

Using the browser-use library for a simple research task:

# pip install browser-use playwright langchain-anthropic # playwright install chromium from browser_use import Agent from langchain_anthropic import ChatAnthropic import asyncio async def main(): llm = ChatAnthropic(model="claude-opus-4-5") agent = Agent( task="Go to https://news.ycombinator.com and return " "the titles and URLs of the top 5 stories.", llm=llm, # Optional: use_vision=True for screenshot-based observation ) result = await agent.run(max_steps=20) print(result.final_result()) asyncio.run(main())

For custom browser control with direct Playwright:

import anthropic, asyncio from playwright.async_api import async_playwright client = anthropic.Anthropic() async def get_page_state(page) -> str: # Simplified: get page title + visible links title = await page.title() links = await page.evaluate( "() => Array.from(document.querySelectorAll('a')).slice(0,20)" ".map(a => ({text: a.textContent.trim(), href: a.href}))" ) return f"Page: {title}\nLinks: {links}" async def browse(goal: str): async with async_playwright() as p: browser = await p.chromium.launch(headless=True) page = await browser.new_page() await page.goto("https://news.ycombinator.com") for _ in range(10): state = await get_page_state(page) resp = client.messages.create( model="claude-opus-4-5", max_tokens=500, messages=[{ "role": "user", "content": f"Goal: {goal}\n\nPage state:\n{state}\n\nWhat is the result?" }] ) print(resp.content[0].text) break # single pass for demo asyncio.run(browse("List the top 3 Hacker News stories"))
SECTION 06

Reliability Patterns

Browser agents frequently encounter failure modes that don't exist in API-based tools. These patterns significantly improve reliability in production.

Step limiting and loop detection: Set a hard cap on steps (typically 20โ€“50 for most tasks). Track which pages the agent has visited and actions it has taken. If the agent repeats the same action 3 times without progress, escalate to human review.

Error recovery: When an action fails (element not found, page load timeout, CAPTCHA), give the agent explicit recovery options: try an alternative selector, reload the page, or navigate to the home page and restart. Don't silently retry the same failed action.

Goal verification: After the agent signals completion, run a verification pass: did the agent actually achieve the goal? For data extraction, validate that the output matches the expected format and isn't empty. For form submission, verify the confirmation page loaded.

Structured output: Ask the agent to return results in structured JSON rather than natural language. This makes downstream processing reliable and errors obvious.

Anti-bot detection: Most production websites use bot detection (Cloudflare, reCAPTCHA, behavioral fingerprinting). Browser agents are frequently blocked on high-traffic commercial sites. Ensure compliance with the site's terms of service and robots.txt before deploying. CAPTCHAs require human intervention โ€” never attempt to bypass them programmatically.
SECTION 07

Production & Safety

Deploying browser agents in production requires careful attention to security, legal compliance, and operational reliability.

Security: Run browsers in isolated Docker containers with no access to the host filesystem or internal network. Use dedicated browser profiles with no saved passwords or payment methods. Never let an agent access URLs or sites outside a predefined allowlist for sensitive workflows.

Credential safety: Never inject credentials into the agent's context window. Use a separate credential injection step after the agent has navigated to the login form โ€” don't let the LLM see or handle passwords. Consider a human-in-the-loop step for any action involving financial transactions or account changes.

Legal compliance: Respect robots.txt and terms of service. Many sites explicitly prohibit automated access. Competitive scraping of certain data (financial, personal) may have legal implications. Always consult legal before deploying agents that interact with third-party sites at scale.

Monitoring: Log every agent step, the page state observed, and the action taken. Set up alerts for high error rates, unusual action patterns, and unexpected navigation to sensitive domains. Implement rate limits to prevent inadvertent denial-of-service on target sites.

Starting point: Before building a custom browser agent, check if the target site has a public API. APIs are faster, cheaper, more reliable, and legally clearer than browser automation. Use browser agents specifically for cases where no API exists.