web2mcp

URL -> installable CLI + MCP stdio adapter for Codex and Claude Code

Give it a website URL. It scans the site, detects its type, discovers APIs (OpenAPI, GraphQL, REST), and generates a ready-to-install adapter package with a CLI and a local MCP stdio server.

web2mcp create https://novelbin.me/ --ready

No AI key required. The default pipeline is 100% deterministic.

What you get

Every generated adapter includes:

Universal browser tools (9) - always present: browser_open, browser_get_state, browser_click, browser_fill, browser_press, browser_back, browser_wait, browser_extract, browser_screenshot

Management tools (4): adapter_info, adapter_doctor, list_available_actions, auth_status

Site-map tools (4): list_pages, find_page, open_page, refresh_site_map

API tools (when detected): api_search, api_list, gql_<query> (GraphQL)

Site-specific actions (heuristic): search, list_links, extract_cards, extract_table, extract_article, go_next_page, get_page, get_by_<param>, scroll_load_more, etc.

Quick start

git clone <repo>
cd web2mcp
npm install
npm run build

# Check environment
node dist/index.js doctor

# Create an adapter (no AI needed)
node dist/index.js create https://example.com --ready

The --ready flag runs the full cycle: scan -> generate -> install deps -> build -> doctor -> mcp:check -> print install commands.

Requirements

Node.js 20+
npm
Chromium (auto-installed via Playwright: npx playwright install chromium)
ThaiLLM API key (optional - AI is off by default)

Architecture

URL
 |
 +-> Scanner (Playwright)
 |     - SPA auto-detection (React/Vue/Angular/Next/Nuxt/Svelte)
 |     - Adaptive wait (skeleton detection, element count growth)
 |     - XHR/fetch capture with dedicated bucket (no cap)
 |     - apiWaitMs: extra wait for API calls to fire after page load
 |     - Shadow DOM + same-origin iframe extraction
 |     - Multi-page crawl (--max-pages)
 |     - Response body inspection (JSON schema extraction)
 |
 +-> API Discovery (multi-strategy)
 |     - OpenAPI/Swagger spec probe (12 common paths)
 |     - GraphQL introspection
 |     - networkHints XHR/fetch analysis
 |     - Direct path probing (/api/v1, /api/v2, /graphql, ...) when hints are empty
 |     - URL pattern extraction (/posts/{slug}, /users/{id})
 |     - Pagination query param detection (?page=N, ?cursor=xxx)
 |
 +-> Compact scan (token-efficient, importance-scored)
 |
 +-> Site type detection (scoring system, 83%+ accuracy)
 |     - search / docs / blog / ecommerce / directory / tool / dashboard / unknown
 |     - Wiki/encyclopedia detection
 |     - API site detection (URL + title signals)
 |
 +-> Action template library (17 templates, deterministic, no AI)
 |
 +-> Package generator (29 files)
 |     - src/runtime/ (browser session, state, tools, permission, trace,
 |                     overlay detector, site-map, auth profile, SPA detector)
 |     - src/cli.ts, src/mcp.ts, src/runner.ts
 |     - adapter.json, .mcp.json.example, codex.config.toml.example
 |
 +-> Smoke test (npm install, build, doctor, mcp:check, live get_page_text)
 +-> Quality scoring (auto-disable actions below 0.6 threshold)
 +-> Compatibility report (Excellent/Good/Partial/Poor + API endpoints)
 +-> Ready build (--ready: full install + build + doctor + mcp:check)

Configuration

Copy .env.example to .env (optional - only needed for AI features):

AI_BASE_URL=http://thaillm.or.th/api/v1
AI_API_KEY=your_key_here
AI_MODEL=typhoon

Or create web2mcp.config.json in your project root:

{
  "outputDir": "outputs",
  "scanner": {
    "maxPages": 2,
    "scroll": true,
    "waitMs": 2000,
    "apiWaitMs": 3000
  },
  "ai": { "enabled": false }
}

Config priority: CLI flags > project config > user config (~/.web2mcp/config.json) > defaults.

Parent CLI reference

| Command | Description | |---------|-------------| | create <url> | Full pipeline: scan -> plan -> generate -> smoke test | | create <url> --ready | + install/build/doctor/mcp:check + print install commands | | scan <url> | Scan only -> stdout JSON | | compact <scan> | Shrink scan-result.json for AI | | model <scan> | scan-result -> site-model.json | | inspect-site <url> | Detect site type + which templates apply | | doctor | Check Node, Playwright, AI config | | doctor-adapter <name> | Run adapter's npm run doctor | | install-codex <name> | Print/run codex mcp add (absolute path) | | install-claude <name> | Print/run claude mcp add + Desktop config | | install-both <name> | Both above | | repair <name> | Patch broken locators from latest trace | | upgrade <name> | Regenerate src/ from current templates (keep site-model) | | test-all | Regression test all adapters in outputs/ | | test-all --live | + run live get_page_text + api_search on each adapter | | benchmark | Run create pipeline on benchmarks/sites.json | | benchmark-tasks | Run real browser flows and measure success rates | | benchmark-detection | Measure site type detection accuracy (target: 80%+) | | metrics | Aggregate stats from all adapters | | pack <name> | Export adapter as zip (excludes secrets/node_modules) | | list | List adapters in outputs/ | | clean --stale | Remove adapters from old templates |

`create` flags

| Flag | Default | Description | |------|---------|-------------| | --ready | false | Full install/build/doctor cycle | | --no-ai | (default) | Heuristic planner only | | --ai-actions | false | Allow AI to design extra actions | | --spa | false | Force SPA mode (networkidle + longer waits) | | --api-wait-ms <n> | 2000 | Extra wait for API calls to fire after page load | | --max-pages <n> | 1 | Crawl N in-domain pages (max 10) | | --scroll | false | Scroll to trigger lazy content | | --scroll-times <n> | 3 | Viewport scrolls (cap 10) | | --wait-until <state> | domcontentloaded | Navigation wait state | | --wait-ms <n> | 1500 | Extra wait for SPA rendering | | --skip-test | false | Skip smoke test | | --headful | false | Visible browser | | --timeout <ms> | 30000 | Navigation timeout | | --json | false | Output JSON |

Generated adapter usage

cd outputs/<name>-adapter

# Build
npm run build

# Self-test (includes locator quality check)
npm run doctor

# List actions (enabled/disabled)
npm run cli -- actions
npm run cli -- actions --all

# Run site-specific actions
npm run cli -- run get_page_text --json
npm run cli -- run search --set query="hello" --json
npm run cli -- run api_search --set query="hello" --json   # API-first (no browser)
npm run cli -- run get_by_post_slug --set post_slug="my-post" --json

# Dry-run a write action (preview without executing)
npm run cli -- run submit_form --set name="test" --dry-run

# Universal browser tools
npm run cli -- browser open https://example.com/ --json
npm run cli -- browser state --json
npm run cli -- browser state --json --max-elements 20   # compact
npm run cli -- browser click <elementId>
npm run cli -- browser fill <elementId> "value" --confirm
npm run cli -- browser extract text --json
npm run cli -- browser extract links --json
npm run cli -- browser extract tables --json
npm run cli -- browser screenshot --out shot.png

# Site map
npm run cli -- pages refresh --max-pages 5
npm run cli -- pages list

# Auth (manual login)
npm run cli -- auth setup
npm run cli -- auth status
npm run cli -- auth clear

# Trace inspection
npm run cli -- trace list
npm run cli -- trace show last
npm run cli -- trace open last

# MCP check (real client test - spawns server + calls tools)
npm run mcp:check
npm run mcp:check --quick   # skip live browser_open

# Install scripts
npm run install:codex
npm run install:claude

Install with Codex

codex mcp add <name> -- node "/absolute/path/to/dist/mcp.js"
codex mcp list

Or edit ~/.codex/config.toml (template in codex.config.toml.example):

[mcp_servers.<name>]
command = "node"
args = ["/absolute/path/to/dist/mcp.js"]
startup_timeout_sec = 10
tool_timeout_sec = 60
enabled = true

Install with Claude Code

claude mcp add <name> --transport stdio -- node "/absolute/path/to/dist/mcp.js"
claude mcp list

Claude Desktop config in .mcp.json.example:

{
  "mcpServers": {
    "<name>": {
      "type": "stdio",
      "command": "node",
      "args": ["/absolute/path/to/dist/mcp.js"],
      "env": {}
    }
  }
}

API Discovery

web2mcp uses multiple strategies to discover APIs:

1. OpenAPI/Swagger: probes 12 common paths (/openapi.json, /swagger.json, /api-docs, etc.)

If found: generates actions from spec with confidence 0.95
Supports OpenAPI 3.x and Swagger 2.x

2. GraphQL: sends introspection query to detected /graphql endpoints

Generates gql_<query> actions using graphql_query step type (direct HTTP, no browser)

3. REST/JSON endpoints: analyzes XHR/fetch networkHints

Classifies as search/list/detail/pagination/graphql
Generates api_search / api_list actions using fetch_json step type

4. Direct path probing: when networkHints yield no API endpoints

Probes /api/v1, /api/v2, /api, /rest, /graphql, etc. via HEAD/GET
Detects JSON responses and classifies endpoints

5. URL patterns: detects /posts/{slug}, /users/{id} from link analysis

Generates get_by_<param>(id) actions

All API actions use direct HTTP (fetch_json / graphql_query step types) - no browser required, 10x faster.

Step types

| Step | Transport | Description | |------|-----------|-------------| | goto | Browser | Navigate to URL | | fill | Browser | Fill input element | | click | Browser | Click element | | press | Browser | Press key | | wait | Browser | Wait N ms | | select | Browser | Native <select> option | | select_option | Browser | Native + custom div dropdowns | | extract_text | Browser | Extract visible text | | extract_links | Browser | Extract all links | | extract_list | Browser | Extract list/card items | | screenshot | Browser | Capture PNG | | fetch_json | HTTP | Direct fetch() - no browser, 10x faster | | graphql_query | HTTP | Direct GraphQL - no browser |

Permission levels

| Level | Default | Requires | |-------|---------|---------| | read | allowed | nothing | | navigate | allowed | nothing | | write | blocked | confirm: true | | auth | blocked | confirm: true | | download | blocked | confirm: true | | dangerous | blocked | confirm: true + allowRisky: true |

CAPTCHA -> CAPTCHA_DETECTED (never bypassed) Rate limit -> RATE_LIMITED Access blocked -> ACCESS_BLOCKED

Resilience features

| Feature | Description | |---------|-------------| | SPA auto-detection | Detects React/Vue/Angular/Next/Nuxt/Svelte, adjusts wait strategy | | Adaptive wait | If element count < 5, waits 2s more for SPA rendering | | API wait | apiWaitMs extra wait for API calls to fire after page load | | Locator self-healing | Re-captures page state and matches by kind+text if locators fail | | Session crash recovery | Auto-restarts browser session on crash | | Network retry | Navigation retried 3x with exponential backoff (skips ERR_BLOCKED) | | Cookie/modal dismiss | Playwright click -> JS click -> Escape key fallback | | Context isolation | isolate: true option for fresh browser context per action | | Anti-bot detection | RATE_LIMITED + ACCESS_BLOCKED error codes | | Auto-repair hint | LOCATOR_FAILED errors include web2mcp repair command |

Site type detection

Accuracy: 83%+ on benchmark set (6 diverse sites).

| Type | Signals | |------|---------| | search | Search input + form | | docs | Documentation keywords, wiki/encyclopedia, many headings | | blog | Article/author/published signals, pagination | | ecommerce | Price patterns, rating/review, product cards | | directory | Many links, repeated cards, pagination | | tool | API URL patterns, HTTP/REST/JSON in title, forms without search | | dashboard | Login form | | unknown | Insufficient signals |

Run web2mcp benchmark-detection to measure accuracy on your benchmark set.

Trace system

Every tool call writes to traces/<traceId>/:

trace.json - tool, input, steps, result, duration
screenshot.png - viewport on error
dom.html - outer HTML on error (200kb cap)

npm run cli -- trace list
npm run cli -- trace show last
web2mcp repair <adapter>   # patch locators from latest trace

Compatibility report

Every create run generates compatibility-report.json + .md:

| Level | Score | |-------|-------| | Excellent | 85-100 | | Good | 70-84 | | Partial | 50-69 | | Poor | 0-49 |

Includes: reachability, content detection, locator quality, action count, API endpoints found.

Limitations

No CAPTCHA bypass
No purchase/payment/delete automation
Authentication requires manual login via auth setup
Multi-page crawl capped at 10 pages
API actions only work for public (unauthenticated) endpoints
GraphQL mutations not generated (read-only queries only)
MCP streaming output not yet supported (requires SDK upgrade)
YAML OpenAPI specs not yet parsed (JSON only)

Development

npm test              # 205 unit tests (TAP reporter, ASCII-safe)
npm run test:e2e      # 12 e2e MCP client tests
npm run lint
npm run format

Integration tests (require built adapter):

node tests/integration/permission-gate.mjs
node tests/integration/sitemap-live.mjs
node tests/integration/repair-full.mjs
node tests/integration/self-healing.mjs

License

MIT

MCP Servers

web2mcp

What you get

Quick start

Requirements

Architecture

Configuration

Parent CLI reference

`create` flags

Generated adapter usage

Install with Codex

Install with Claude Code

API Discovery

Step types

Permission levels

Resilience features

Site type detection

Trace system

Compatibility report

Limitations

Development

License

安装包（如果需要）

Cursor 配置 (mcp.json)

web2mcp

What you get

Quick start

Requirements

Architecture

Configuration

Parent CLI reference

create flags

Generated adapter usage

Install with Codex

Install with Claude Code

API Discovery

Step types

Permission levels

Resilience features

Site type detection

Trace system

Compatibility report

Limitations

Development

License

安装包 （如果需要）

Cursor 配置 (mcp.json)

`create` flags

安装包（如果需要）