π₯ The web is hostile to AI Agents. We brought a crowbar. Zero-cost, zero-SaaS documentation discovery for Claude, Cursor, and Windsurf.
π₯ DocBreach (MCP)
The web is hostile to AI Agents. We brought a crowbar.
Quickstart Β· Architecture Β· Why This Exists
π The Problem: Agentic Workflows Are Blind
We're in the era of autonomous AI Agents β but the web was built to repel bots, not serve them.
When your Claude, Cursor, or Windsurf tries to read an obscure API's documentation, it gets annihilated by:
- Cloudflare WAFs throwing 403 CAPTCHAs at Node.js
fetch(). - Empty SPA shells (Next.js, Mintlify, GitBook) that render nothing without a $300M headless browser.
- Legacy enterprise PDFs that crash the model's context window.
- Login walls that lock public API references behind OAuth gates.
- "AI-friendly" SaaS tools (Firecrawl, Jina, Context7) charging you $50/mo to read pages that are already public.
The LLM doesn't need a middleman. It needs raw signal.
βοΈ The Weapon: Guerrilla Architecture
DocBreach is a ruthless, 100% local MCP server. It doesn't ask for permission. It uses military-grade heuristics to extract clean, LLM-optimized Markdown from any developer portal β and it does it for free, forever.
| Enemy Defense | DocBreach Tactical Override |
| :--- | :--- |
| π‘οΈ Cloudflare / WAF 403 | Temporal Proxying β Hits a WAF? Silently pivots to the Wayback Machine. The docs from last week work just fine. |
| βοΈ JavaScript SPA Walls | Hydration Hijacking β Rips __NEXT_DATA__, __NUXT__, __GITBOOK_STATE__ straight from the DOM. Zero JS engine needed. |
| πͺ Hidden iFrames | Source Chasing β Detects embedded Swagger/Postman/Stoplight apps, destroys the wrapper, resolves the true origin URL. |
| π Legacy PDF Manuals | Native Brute-Force β In-memory PDF parsing. Your AI reads 2004 banking manuals like they're GitHub READMEs. |
| π Login Walls | Wall Detection β Identifies OAuth/SSO gates instantly and tells the agent to pivot to public alternatives. |
| π³οΈ Ghost Town Sites | Self-Healing Errors β No docs found? DocBreach guides the agent to search GitHub repos, SDK source code, or llms.txt files. |
| πΈ SaaS Scraping Taxes | Zero. Forever. Everything runs locally via Cheerio and Turndown. No API keys. No accounts. No telemetry. |
"The LLM shouldn't be smart at scraping. It should be smart at coding. DocBreach handles the dirty work."
π Quickstart
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"docbreach": {
"command": "npx",
"args": ["-y", "doc-breach-mcp"]
}
}
}
Cursor / Windsurf
Add to your MCP settings:
{
"doc-breach": {
"command": "npx",
"args": ["-y", "doc-breach-mcp"]
}
}
That's it. No API keys. No .env files. No sign-ups. It just works.
π§ How It Thinks
DocBreach gives your AI agent 4 precision tools and lets the model drive:
You: "Integrate with the Datadog API and list all monitors"
Agent β docs.discover({ query: "datadog API" })
β Found: docs.datadoghq.com/api/latest/ (openapi)
Agent β docs.map({ domain: "docs.datadoghq.com" })
β πΊοΈ Sitemap hierarchy, auto-generated Mermaid graph, and llms.txt discovery
Agent β docs.read({ url: "https://docs.datadoghq.com/api/latest/" })
β π Clean Markdown + nav links + auth requirements
Agent β docs.extract({ url: "https://api.datadoghq.com/api/v2/openapi.yaml", tag: "monitors" })
β π GET /api/v1/monitor β List all monitors
GET /api/v1/monitor/{id} β Get a monitor's details
POST /api/v1/monitor β Create a monitor
...
Agent: "I see the API requires DD-API-KEY and DD-APPLICATION-KEY headers,
and you need to select a DD_SITE (US1, EU, US3, US5, AP1)..."
The model reasons. DocBreach retrieves. Nobody hallucinates.
The 11-Step Reader Pipeline
Every URL passes through a battle-hardened, 11-step extraction pipeline:
URL
β
ββ 1. Preflight ββββββββ HEAD check β Content-Type, size, reject >10MB
ββ 2. Fetch ββββββββββββ GET + Wayback Machine fallback on 403/503
ββ 3. Login Detection ββ OAuth/SSO wall? β abort + guide agent
ββ 4. Format Detection β OpenAPI? Postman? PDF? llms.txt? Markdown?
ββ 5. Binary Handling ββ PDF <5MB β in-memory parse
ββ 6. Spec Summary βββββ OpenAPI/Postman β structured Markdown
ββ 7. SPA Hydration ββββ __NEXT_DATA__, __NUXT__, readme-data, GitBook
ββ 8. Nav Extraction βββ Sidebar links β absolute URLs
ββ 9. iFrame Intel βββββ Swagger/Postman/Stoplight embed β true URL
ββ 10. HTML Cleaning βββ Cheerio β remove headers, footers, ads, nav
ββ 11. Markdown ββββββββ Turndown + boundary-aware truncation
β
βΌ
Clean, LLM-ready Markdown
πͺ The Uncomfortable Truth
The developer tooling market has a parasite problem.
Companies like Firecrawl, Jina Reader, and Context7 take public documentation β pages that are freely accessible to any browser β wrap them in a proprietary API, and charge you a monthly subscription to access what was already yours.
They aren't adding value. They're adding a toll booth to the public internet.
DocBreach exists because:
- Documentation is public. If a human can read it, an agent should too.
- Scraping is a solved problem. Cheerio + Turndown have existed for a decade. You don't need a $20M startup to parse HTML.
- Your AI runs locally. Why should it phone home to a SaaS to read a README?
This is not a product. This is a crowbar.
π DocBreach vs. The Toll Booths
| | DocBreach | Firecrawl | Jina Reader | Context7 | |---|:---:|:---:|:---:|:---:| | Cost | $0 | $50+/mo | $30+/mo | Free (limited) | | Runs locally | β | β Cloud | β Cloud | β Cloud | | No API keys | β | β | β | β | | No telemetry | β | β | β | β | | WAF bypass | β Wayback | β Paid proxy | β | β | | SPA extraction | β Hydration | β Headless | β | β | | PDF parsing | β Native | β | β | β | | OpenAPI extraction | β | β | β | β | | HATEOAS navigation | β | β | β | β | | Cognitive rules | β | β | β | β | | Open source | β MIT | Partial | β | β |
π§ Tools Reference
docs.discover
Find documentation sources for any service, library, or API.
docs.discover({ query: "stripe webhooks API" })
// β [ { url, title, type: "openapi", source: "probe" }, ... ]
docs.map
Map the complete documentation structure of any domain. Extracts sitemaps, robots.txt, and llms.txt, returning an architectural blueprint.
docs.map({ domain: "docs.stripe.com" })
// β { total: 1200, sections: { "Root": [...], "API": [...] }, ... }
docs.read
Read any documentation URL and return clean, LLM-ready Markdown.
docs.read({ url: "https://docs.stripe.com/webhooks" })
// β { content: "# Webhooks\n\n...", nav_links: [...], format: "html" }
docs.search
Search for specific topics within a documentation site.
docs.search({ query: "authentication", site: "docs.stripe.com" })
// β [ { url: ".../authentication", title: "Authentication", ... } ]
docs.extract
Extract structured endpoint information from OpenAPI/Swagger/Postman specs.
docs.extract({ url: "https://api.stripe.com/openapi/spec.json", tag: "charges" })
// β [ { method: "POST", path: "/v1/charges", summary: "Create a charge" }, ... ]
π Beyond the MCP Specification
Google and Anthropic's official MCP best practices ask for "Single Responsibility," "Clear Descriptions," and "Structured Error Handling." That is the bare minimum.
Thanks to Vurb.ts, DocBreach elevates these concepts to the tenth power, operating years ahead of the standard protocol:
- MVA Architecture (Model β View β Agent): Standard MCP returns raw JSON strings. We route everything through Fluent Presenters acting as smart egress firewalls, stripping noise before the LLM ever sees it.
- HATEOAS Navigation: Instead of the agent guessing what to do next, every DocBreach response includes a
.suggestActions()payload telling the model exactly which tool to call next. - JIT System Rules: Dynamic instructions injected mid-flight based on payload context (e.g., "The content was truncated, use search").
- Self-Healing Errors: Standard MCP throws an error. DocBreach returns an error and the exact prompt/tool required to recover from it.
- Server-Side Mermaid UI: Sends native
ui.mermaid()visual graphs to the MCP Inspector to help humans see the architecture the agent sees. - State Sync & Cache Control: Emits
.cached()directives at the protocol level to eliminate duplicate requests and save LLM token context.
π License
MIT β because documentation should be free, and so should the tools that read it.
Stop paying rent to read public web pages.
β Star this repo if you agree.