MCP server by babybank25
web2mcp
URL -> installable CLI + MCP stdio adapter for Codex and Claude Code
Give it a website URL. It scans the site, detects its type, discovers APIs (OpenAPI, GraphQL, REST), and generates a ready-to-install adapter package with a CLI and a local MCP stdio server.
web2mcp create https://novelbin.me/ --ready
No AI key required. The default pipeline is 100% deterministic.
What you get
Every generated adapter includes:
Universal browser tools (9) - always present:
browser_open, browser_get_state, browser_click, browser_fill, browser_press,
browser_back, browser_wait, browser_extract, browser_screenshot
Management tools (4): adapter_info, adapter_doctor, list_available_actions, auth_status
Site-map tools (4): list_pages, find_page, open_page, refresh_site_map
API tools (when detected): api_search, api_list, gql_<query> (GraphQL)
Site-specific actions (heuristic): search, list_links, extract_cards, extract_table,
extract_article, go_next_page, get_page, get_by_<param>, scroll_load_more, etc.
Quick start
git clone <repo>
cd web2mcp
npm install
npm run build
# Check environment
node dist/index.js doctor
# Create an adapter (no AI needed)
node dist/index.js create https://example.com --ready
The --ready flag runs the full cycle: scan -> generate -> install deps -> build -> doctor -> mcp:check -> print install commands.
Requirements
- Node.js 20+
- npm
- Chromium (auto-installed via Playwright:
npx playwright install chromium) - ThaiLLM API key (optional - AI is off by default)
Architecture
URL
|
+-> Scanner (Playwright)
| - SPA auto-detection (React/Vue/Angular/Next/Nuxt/Svelte)
| - Adaptive wait (skeleton detection, element count growth)
| - XHR/fetch capture with dedicated bucket (no cap)
| - apiWaitMs: extra wait for API calls to fire after page load
| - Shadow DOM + same-origin iframe extraction
| - Multi-page crawl (--max-pages)
| - Response body inspection (JSON schema extraction)
|
+-> API Discovery (multi-strategy)
| - OpenAPI/Swagger spec probe (12 common paths)
| - GraphQL introspection
| - networkHints XHR/fetch analysis
| - Direct path probing (/api/v1, /api/v2, /graphql, ...) when hints are empty
| - URL pattern extraction (/posts/{slug}, /users/{id})
| - Pagination query param detection (?page=N, ?cursor=xxx)
|
+-> Compact scan (token-efficient, importance-scored)
|
+-> Site type detection (scoring system, 83%+ accuracy)
| - search / docs / blog / ecommerce / directory / tool / dashboard / unknown
| - Wiki/encyclopedia detection
| - API site detection (URL + title signals)
|
+-> Action template library (17 templates, deterministic, no AI)
|
+-> Package generator (29 files)
| - src/runtime/ (browser session, state, tools, permission, trace,
| overlay detector, site-map, auth profile, SPA detector)
| - src/cli.ts, src/mcp.ts, src/runner.ts
| - adapter.json, .mcp.json.example, codex.config.toml.example
|
+-> Smoke test (npm install, build, doctor, mcp:check, live get_page_text)
+-> Quality scoring (auto-disable actions below 0.6 threshold)
+-> Compatibility report (Excellent/Good/Partial/Poor + API endpoints)
+-> Ready build (--ready: full install + build + doctor + mcp:check)
Configuration
Copy .env.example to .env (optional - only needed for AI features):
AI_BASE_URL=http://thaillm.or.th/api/v1
AI_API_KEY=your_key_here
AI_MODEL=typhoon
Or create web2mcp.config.json in your project root:
{
"outputDir": "outputs",
"scanner": {
"maxPages": 2,
"scroll": true,
"waitMs": 2000,
"apiWaitMs": 3000
},
"ai": { "enabled": false }
}
Config priority: CLI flags > project config > user config (~/.web2mcp/config.json) > defaults.
Parent CLI reference
| Command | Description |
|---------|-------------|
| create <url> | Full pipeline: scan -> plan -> generate -> smoke test |
| create <url> --ready | + install/build/doctor/mcp:check + print install commands |
| scan <url> | Scan only -> stdout JSON |
| compact <scan> | Shrink scan-result.json for AI |
| model <scan> | scan-result -> site-model.json |
| inspect-site <url> | Detect site type + which templates apply |
| doctor | Check Node, Playwright, AI config |
| doctor-adapter <name> | Run adapter's npm run doctor |
| install-codex <name> | Print/run codex mcp add (absolute path) |
| install-claude <name> | Print/run claude mcp add + Desktop config |
| install-both <name> | Both above |
| repair <name> | Patch broken locators from latest trace |
| upgrade <name> | Regenerate src/ from current templates (keep site-model) |
| test-all | Regression test all adapters in outputs/ |
| test-all --live | + run live get_page_text + api_search on each adapter |
| benchmark | Run create pipeline on benchmarks/sites.json |
| benchmark-tasks | Run real browser flows and measure success rates |
| benchmark-detection | Measure site type detection accuracy (target: 80%+) |
| metrics | Aggregate stats from all adapters |
| pack <name> | Export adapter as zip (excludes secrets/node_modules) |
| list | List adapters in outputs/ |
| clean --stale | Remove adapters from old templates |
create flags
| Flag | Default | Description |
|------|---------|-------------|
| --ready | false | Full install/build/doctor cycle |
| --no-ai | (default) | Heuristic planner only |
| --ai-actions | false | Allow AI to design extra actions |
| --spa | false | Force SPA mode (networkidle + longer waits) |
| --api-wait-ms <n> | 2000 | Extra wait for API calls to fire after page load |
| --max-pages <n> | 1 | Crawl N in-domain pages (max 10) |
| --scroll | false | Scroll to trigger lazy content |
| --scroll-times <n> | 3 | Viewport scrolls (cap 10) |
| --wait-until <state> | domcontentloaded | Navigation wait state |
| --wait-ms <n> | 1500 | Extra wait for SPA rendering |
| --skip-test | false | Skip smoke test |
| --headful | false | Visible browser |
| --timeout <ms> | 30000 | Navigation timeout |
| --json | false | Output JSON |
Generated adapter usage
cd outputs/<name>-adapter
# Build
npm run build
# Self-test (includes locator quality check)
npm run doctor
# List actions (enabled/disabled)
npm run cli -- actions
npm run cli -- actions --all
# Run site-specific actions
npm run cli -- run get_page_text --json
npm run cli -- run search --set query="hello" --json
npm run cli -- run api_search --set query="hello" --json # API-first (no browser)
npm run cli -- run get_by_post_slug --set post_slug="my-post" --json
# Dry-run a write action (preview without executing)
npm run cli -- run submit_form --set name="test" --dry-run
# Universal browser tools
npm run cli -- browser open https://example.com/ --json
npm run cli -- browser state --json
npm run cli -- browser state --json --max-elements 20 # compact
npm run cli -- browser click <elementId>
npm run cli -- browser fill <elementId> "value" --confirm
npm run cli -- browser extract text --json
npm run cli -- browser extract links --json
npm run cli -- browser extract tables --json
npm run cli -- browser screenshot --out shot.png
# Site map
npm run cli -- pages refresh --max-pages 5
npm run cli -- pages list
# Auth (manual login)
npm run cli -- auth setup
npm run cli -- auth status
npm run cli -- auth clear
# Trace inspection
npm run cli -- trace list
npm run cli -- trace show last
npm run cli -- trace open last
# MCP check (real client test - spawns server + calls tools)
npm run mcp:check
npm run mcp:check --quick # skip live browser_open
# Install scripts
npm run install:codex
npm run install:claude
Install with Codex
codex mcp add <name> -- node "/absolute/path/to/dist/mcp.js"
codex mcp list
Or edit ~/.codex/config.toml (template in codex.config.toml.example):
[mcp_servers.<name>]
command = "node"
args = ["/absolute/path/to/dist/mcp.js"]
startup_timeout_sec = 10
tool_timeout_sec = 60
enabled = true
Install with Claude Code
claude mcp add <name> --transport stdio -- node "/absolute/path/to/dist/mcp.js"
claude mcp list
Claude Desktop config in .mcp.json.example:
{
"mcpServers": {
"<name>": {
"type": "stdio",
"command": "node",
"args": ["/absolute/path/to/dist/mcp.js"],
"env": {}
}
}
}
API Discovery
web2mcp uses multiple strategies to discover APIs:
1. OpenAPI/Swagger: probes 12 common paths (/openapi.json, /swagger.json, /api-docs, etc.)
- If found: generates actions from spec with confidence 0.95
- Supports OpenAPI 3.x and Swagger 2.x
2. GraphQL: sends introspection query to detected /graphql endpoints
- Generates
gql_<query>actions usinggraphql_querystep type (direct HTTP, no browser)
3. REST/JSON endpoints: analyzes XHR/fetch networkHints
- Classifies as search/list/detail/pagination/graphql
- Generates
api_search/api_listactions usingfetch_jsonstep type
4. Direct path probing: when networkHints yield no API endpoints
- Probes
/api/v1,/api/v2,/api,/rest,/graphql, etc. via HEAD/GET - Detects JSON responses and classifies endpoints
5. URL patterns: detects /posts/{slug}, /users/{id} from link analysis
- Generates
get_by_<param>(id)actions
All API actions use direct HTTP (fetch_json / graphql_query step types) - no browser required, 10x faster.
Step types
| Step | Transport | Description |
|------|-----------|-------------|
| goto | Browser | Navigate to URL |
| fill | Browser | Fill input element |
| click | Browser | Click element |
| press | Browser | Press key |
| wait | Browser | Wait N ms |
| select | Browser | Native <select> option |
| select_option | Browser | Native + custom div dropdowns |
| extract_text | Browser | Extract visible text |
| extract_links | Browser | Extract all links |
| extract_list | Browser | Extract list/card items |
| screenshot | Browser | Capture PNG |
| fetch_json | HTTP | Direct fetch() - no browser, 10x faster |
| graphql_query | HTTP | Direct GraphQL - no browser |
Permission levels
| Level | Default | Requires |
|-------|---------|---------|
| read | allowed | nothing |
| navigate | allowed | nothing |
| write | blocked | confirm: true |
| auth | blocked | confirm: true |
| download | blocked | confirm: true |
| dangerous | blocked | confirm: true + allowRisky: true |
CAPTCHA -> CAPTCHA_DETECTED (never bypassed)
Rate limit -> RATE_LIMITED
Access blocked -> ACCESS_BLOCKED
Resilience features
| Feature | Description |
|---------|-------------|
| SPA auto-detection | Detects React/Vue/Angular/Next/Nuxt/Svelte, adjusts wait strategy |
| Adaptive wait | If element count < 5, waits 2s more for SPA rendering |
| API wait | apiWaitMs extra wait for API calls to fire after page load |
| Locator self-healing | Re-captures page state and matches by kind+text if locators fail |
| Session crash recovery | Auto-restarts browser session on crash |
| Network retry | Navigation retried 3x with exponential backoff (skips ERR_BLOCKED) |
| Cookie/modal dismiss | Playwright click -> JS click -> Escape key fallback |
| Context isolation | isolate: true option for fresh browser context per action |
| Anti-bot detection | RATE_LIMITED + ACCESS_BLOCKED error codes |
| Auto-repair hint | LOCATOR_FAILED errors include web2mcp repair command |
Site type detection
Accuracy: 83%+ on benchmark set (6 diverse sites).
| Type | Signals |
|------|---------|
| search | Search input + form |
| docs | Documentation keywords, wiki/encyclopedia, many headings |
| blog | Article/author/published signals, pagination |
| ecommerce | Price patterns, rating/review, product cards |
| directory | Many links, repeated cards, pagination |
| tool | API URL patterns, HTTP/REST/JSON in title, forms without search |
| dashboard | Login form |
| unknown | Insufficient signals |
Run web2mcp benchmark-detection to measure accuracy on your benchmark set.
Trace system
Every tool call writes to traces/<traceId>/:
trace.json- tool, input, steps, result, durationscreenshot.png- viewport on errordom.html- outer HTML on error (200kb cap)
npm run cli -- trace list
npm run cli -- trace show last
web2mcp repair <adapter> # patch locators from latest trace
Compatibility report
Every create run generates compatibility-report.json + .md:
| Level | Score | |-------|-------| | Excellent | 85-100 | | Good | 70-84 | | Partial | 50-69 | | Poor | 0-49 |
Includes: reachability, content detection, locator quality, action count, API endpoints found.
Limitations
- No CAPTCHA bypass
- No purchase/payment/delete automation
- Authentication requires manual login via
auth setup - Multi-page crawl capped at 10 pages
- API actions only work for public (unauthenticated) endpoints
- GraphQL mutations not generated (read-only queries only)
- MCP streaming output not yet supported (requires SDK upgrade)
- YAML OpenAPI specs not yet parsed (JSON only)
Development
npm test # 205 unit tests (TAP reporter, ASCII-safe)
npm run test:e2e # 12 e2e MCP client tests
npm run lint
npm run format
Integration tests (require built adapter):
node tests/integration/permission-gate.mjs
node tests/integration/sitemap-live.mjs
node tests/integration/repair-full.mjs
node tests/integration/self-healing.mjs
License
MIT