Self-hostable MCP server that lets Claude crawl pages blocked by anti-bot walls, JS challenges, or IP bans, without paid scraping APIs. Bring your own proxies. RabbitMQ queue, Rust crawler, MinIO claim-check.
CrawMcp
Gives Claude (or any MCP client) a web-crawling capability that runs on your own server. It fetches pages blocked by anti-bot walls, JS challenges, or IP bans, without depending on paid, quota-limited scraping services like Firecrawl, Bright Data, or Browserbase. Bring your own proxy, self-host it, and content is deleted automatically by TTL. License: AGPL-3.0.
A URL-crawling MCP server. The model (Claude) cannot send requests directly to most sites. Instead it hands the URL to this MCP, and the MCP writes it to RabbitMQ. A separate real crawler (worker) picks it up from the queue, opens the page, writes the content to MinIO, and sends only a pointer to the queue. The MCP reads that pointer, fetches the content from MinIO, and returns it to the model.
Heavy page content lives in MinIO (object store), job status/metadata lives in Redis, and RabbitMQ only carries small messages. This is the classic Claim Check pattern.
Claude ──[crawl_url]──▶ MCP ──publish(url)──▶ RabbitMQ: crawl.requests
│ │
(metadata to Redis) crawler: opens the page
│ │
│ content ─▶ MinIO: crawl/<jobId> (gzip)
│ │
Claude ◀─[crawl_result]─ MCP ◀── pointer+meta ── RabbitMQ: crawl.results
│
MCP fetches content from MinIO by storageKey
- RabbitMQ = transport pipe (messages are small, deleted once acked).
- Redis
crawmcp:job:<id>= status + metadata (NO text; has TTL, shared). - MinIO
crawl/<jobId>= heavy content (gzipped, the real persistent home).
crawl_url returns a job_id right away (does not block). You fetch the content with
crawl_result using that job_id. It returns pending until the worker finishes.
Tools
| Tool | What it does |
|------|----------|
| crawl_url | { url } → writes to RabbitMQ, returns job_id |
| crawl_result | { job_id, max_length? } → pending / page content (from MinIO) / error |
| crawl_forget | { job_id } → deletes content from MinIO and metadata from Redis (manual/early cleanup) |
Content cleanup (deletion)
To stop content from piling up, job lifetime = content lifetime:
- In sync with TTL (automatic): when the job record in Redis expires per
JOB_TTL_SECONDS, the MCP catches it via a Redis expired event and deletes the matching MinIO object (crawl/<jobId>). (The MCP setsnotify-keyspace-eventsitself at startup.) - Lifecycle backstop: in case an event is missed, a day-based expiry rule on the
bucket (
CONTENT_BACKSTOP_DAYS, default 1 day) stands as a safety net. - Manual/early:
crawl_forget(job_id)deletes the content + metadata immediately.
Setup
npm install
npm run build
Running
npm run infra:up # RabbitMQ + Redis + MinIO + worker-rs (Rust crawler), docker
npm run build # build the MCP server → dist/
# The MCP server is started by the client (Claude Code/Desktop); standalone too: npm start
# npm run worker:example → pure-JS reference worker (contract/local test; production = worker-rs)
UIs: RabbitMQ → http://localhost:15672 (guest/guest), MinIO → http://localhost:9001 (minioadmin/minioadmin). To stop: npm run infra:down.
Adding it to Claude Code
claude mcp add crawmcp -- node <abs>/dist/index.js # <abs> = absolute path of this repo
Message contract (between crawler and MCP)
This is all you need to know when writing your own crawler. Heavy text does not go in the message. The crawler writes content to MinIO and sends a pointer to the queue.
Request, MCP → crawl.requests:
{ "jobId": "crawl_1_...", "url": "https://...", "requestedAt": 1730000000000, "storageKey": "crawl/crawl_1_..." }
Result, crawler → crawl.results (first write the content gzipped to storageKey, then):
{
"jobId": "crawl_1_...",
"result": {
"ok": true,
"status": 200,
"contentType": "text/html",
"size": 528,
"storageKey": "crawl/crawl_1_...",
"encoding": "gzip"
}
}
On error: { "jobId": "...", "result": { "ok": false, "error": "..." } }
The MCP suggests the
storageKey(crawl/<jobId>). If the crawler uses a different key, it just needs to report the key it wrote in the result. The MCP reads the content with that key. Ifencoding: "gzip", the MCP decompresses it.
Crawler / worker
Production crawler: worker-rs. A Rust multi-strategy scraper
(TLS-fingerprint HTTP strategies + a real-Chromium --dump-dom BrowserFetch fallback,
JS-challenge solver, optional proxy rotation). It runs as the worker-rs service in
docker-compose.yml (npm run infra:up starts it).
worker/example-worker.mjs is a minimal pure-JS reference.
It shows the message contract (fetch → gzip → MinIO → pointer), for local testing (npm run worker:example). Any worker that follows the contract above works with this MCP without
changes.
Environment variables
| Variable | Default |
|----------|-----------|
| RABBITMQ_URL | amqp://guest:guest@localhost:5672 |
| REDIS_URL | redis://localhost:6379 |
| CRAWL_REQUEST_QUEUE | crawl.requests |
| CRAWL_RESULT_QUEUE | crawl.results |
| JOB_TTL_SECONDS | 3600 (lifetime of metadata in Redis) |
| MINIO_ENDPOINT | http://localhost:9000 |
| MINIO_ACCESS_KEY / MINIO_SECRET_KEY | minioadmin / minioadmin |
| MINIO_BUCKET | crawl-content |
| S3_REGION | us-east-1 |
| CONTENT_BACKSTOP_DAYS | 1 (lifecycle safety net; the real deletion is by TTL) |
Because it uses the AWS SDK, you can switch to Cloudflare R2 / AWS S3 with the same code by changing
MINIO_ENDPOINTand the keys.
Notes
- Data placement: content → MinIO (gzip), metadata/pointer → Redis, RabbitMQ stays light. This way heavy HTML neither bloats Redis RAM nor strains the queue.
- Multiple MCP instances can run because job status is shared in Redis.
- Cleanup: content is deleted in sync with the job TTL (Redis expired-event → MinIO
delete) + day-based lifecycle backstop + manual
crawl_forget. See "Content cleanup". - Note: Redis expired events are not 100% guaranteed delivery (they can be missed if the listener is down at that moment). The backstop exists for exactly this.
- Not yet present: retry, per-job timeout, dead-letter queue (DLQ), returning extracted text/markdown instead of raw content. Added when needed.
License
AGPL-3.0-only, see LICENSE. Use, modify, and self-host it freely. The one condition: if you offer this to others as a network service (hosted SaaS), you must also share the source of your changes under AGPL.
worker-rs statically links the GPL-3.0 licensed wreq-util crate, so it is already under
copyleft. AGPL-3.0 is compatible with it and covers the whole project under one license.
Contributing
See CONTRIBUTING.md and, for architecture notes, docs/ARCHITECTURE.md.