Architecture

How BabelWrap translates natural language into browser interactions and returns structured snapshots.

System Overview

BabelWrap sits between your AI agent and the target website. It provides two interfaces (REST API and MCP server) that both feed into the same session manager, which coordinates the Playwright browser engine and the LLM-powered snapshot proxy.

AgentClaude / GPT / LangChain / custom
MCP Server16 tools, stdio transport
REST APIFastAPI backend
Session ManagerCookies, auth, state, TTL
BabelWrap ProxyLLM snapshot generation
Playwright EngineHeadless browser interaction
Target Website
Site MapperExplore, Validate, Generate
Public CatalogPostgreSQL + in-memory cache

Components

BabelWrap Proxy (Snapshot Renderer)

Converts any webpage's DOM into a structured, LLM-readable snapshot. It auto-scrolls to trigger lazy-loaded content, waits for loading indicators to disappear, and then extracts all inputs, buttons, links, forms, tables, alerts, and navigation elements into a clean JSON structure. The snapshot is designed so an LLM can immediately understand the page without parsing HTML.

Playwright Engine

Executes real browser interactions headlessly using Chromium. Handles navigation, clicks, form fills, file uploads, keyboard input, and screenshots. A shared browser pool means individual sessions are lightweight — no full Chromium instance per session.

Session Manager

Maintains isolated browser contexts, each with its own cookies, localStorage, and page state. Sessions persist across actions and expire automatically after 1 hour of inactivity. Cookie injection and extraction enable authentication persistence across sessions.

Element Resolver

The core innovation that maps natural language descriptions to DOM elements. Uses a three-tier approach:

  1. Direct match — exact ID or label match (fastest, no LLM call)
  2. Cache hit — Redis-cached mapping from (snapshot_hash, target) pair
  3. LLM resolution — Claude Haiku matches the description to the most likely element, with confidence scoring and ambiguity detection

Caching means repeated interactions with the same page structure avoid LLM calls entirely, keeping costs low and latency minimal.

Site Mapper Pipeline

Orchestrates the full site mapping flow: an AI explorer agent discovers the site structure, a validator tests the generated recipes, a tool generator converts them into typed MCP tools, and the results are persisted to PostgreSQL. On every server restart, all ready site models are reloaded and their tools re-registered on the MCP server.

Explorer Agent

Uses the agno framework with Claude Sonnet 4 to intelligently browse a target website. The agent has 11 tools: 6 BabelWrap browsing tools (create session, navigate, click, fill, extract, read page) and 5 reporting tools (report page type, report entity, report recipe, trace flow, mark complete). It visits ~10-20 pages to discover entities, page types, and action flows, then records step-by-step recipes.

Recipe Executor

Executes stored recipes by replaying steps against live websites. On the happy path, steps execute as blind replay with parameter interpolation (no LLM involved). On failure, triggers self-healing: snapshots the current page, asks an LLM to suggest a corrected target, retries, and persists the fix. Includes 4-layer authentication support: cookie injection, stored credentials, login redirect detection, and expired cookie refresh.

Tool Generator

Converts validated recipes into typed Python functions with domain-prefixed names (e.g., linkedin_search_jobs). Functions are dynamically compiled with explicit typed signatures so MCP clients (Claude Desktop, Cursor, etc.) can discover and call them with proper parameter names and types.

Request Flow

Here's what happens when your agent sends a click action:

  1. Agent sends POST /v1/sessions/{id}/click with {"target": "the Sign In button"}
  2. API authenticates the request and looks up the session
  3. Session Manager retrieves the active browser context
  4. BabelWrap Proxy takes a snapshot of the current page
  5. Element Resolver maps "the Sign In button" to a specific DOM element (via cache or LLM)
  6. Playwright Engine clicks the resolved element
  7. BabelWrap Proxy takes a new snapshot of the page after the click
  8. API returns the snapshot to the agent with action metadata (duration, success status)

Tech Stack

LayerTechnology
LanguagePython 3.12+
API FrameworkFastAPI (async)
Browser EnginePlaywright (Chromium)
LLM ClientAnthropic SDK (Claude Haiku)
MCP ServerFastMCP
DatabasePostgreSQL (asyncpg + SQLAlchemy)
Cache / StateRedis
BillingStripe (usage-based metering)
Explorer Agentagno + Claude Sonnet 4
ContainerizationDocker
Self-hosting: BabelWrap ships as a single Docker image. You need PostgreSQL, Redis, and an Anthropic API key. See the deployment guide for details.